# learning_invariant_causal_mechanism_from_visionlanguage_models__26a996f2.pdf

Learning Invariant Causal Mechanism from Vision-Language Models

Zeen Song * 1 2 Siyu Zhao * 2 Xingyu Zhang * 1 2 Jiangmeng Li 1 Changwen Zheng 1 Wenwen Qiang 1

Contrastive Language-Image Pretraining (CLIP) has achieved remarkable success, but its performance can degrade when fine-tuned in out-ofdistribution (OOD) scenarios. We model the prediction process using a Structural Causal Model (SCM) and show that the causal mechanism involving both invariant and variant factors in training environments differs from that in test environments. In contrast, the causal mechanism with solely invariant factors remains consistent across environments. We theoretically prove the existence of a linear mapping from CLIP embeddings to invariant factors, which can be estimated using interventional data. Additionally, we provide a condition to guarantee low OOD risk of the invariant predictor. Based on these insights, we propose the Invariant Causal Mechanism of CLIP (CLIP-ICM) framework. CLIPICM involves collecting interventional data, estimating a linear projection matrix, and making predictions within the invariant subspace. Experiments on several OOD datasets show that CLIPICM significantly improves the performance of CLIP. Our method offers a simple but powerful enhancement, boosting the reliability of CLIP in real-world applications. The source code is available at https://github.com/Zeen Song/CLIP-ICM.

1. Introduction

As one of the most successful large-scale pre-trained visionlanguage models, Contrastive Language-Image Pretraining (CLIP) (Radford et al., 2021) has garnered significant attention. Trained on 400 million diverse image-text pairs, CLIP achieves robust cross-modal alignment without task-specific supervision. This enables direct generalization to unseen

*Equal contribution 1Institute of Software Chinese Academy of Sciences, Beijing, China 2University of the Chinese Academy of Sciences. Correspondence to: Wenwen Qiang <qiangwenwen@iscas.ac.cn>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

tasks by measuring semantic similarity between images and textual category descriptions (Radford et al., 2021). Notably, CLIP performs well in out-of-distribution (OOD) scenarios, effectively adapting to domain variations and recognizing previously unseen categories.

However, this ability of CLIP is task-independent. In many real-world applications, CLIP requires fine-tuning to adapt to specific tasks (Gao et al., 2023; Zhou et al., 2022a; Shu et al., 2023; Zhang et al., 2023b). When fine-tuning CLIP for downstream tasks, a critical question arises: Can CLIP maintain its strong capabilities when faced with test data that differs from the training distribution? More importantly, can it still perform well when encountering new classes not seen during training? To address this, we conduct an experiment on the Terra Incognita dataset (Beery et al., 2018), which contains species data collected from diverse real-world environments. The results in Table 1 show that fine-tuning CLIP in the training domain does not necessarily lead to good performance in the test domain. Furthermore, when encountering new classes not observed in the training domain, fine-tuning on the training domain can harm the zero-shot ability of CLIP.

To address the above problem, we find that inconsistent prediction issues can be well explained from a causal perspective. Inspired by (Pearl, 2009; Pearl & Bareinboim, 2014; Schölkopf et al., 2021), we first propose a Structural Causal Model (SCM) to capture the causal processes underlying OOD predictions of CLIP. In this causal model, predictions rely on two types of latent factors, which are assumed to generate the image data: (1) environment-invariant factors that remain unchanged across different environments and (2) environment-variant factors that vary with the environment. By analyzing the causal mechanisms in the SCM, we observe that if predictions rely on both invariant and variant factors, the causal mechanism in the training environment differs from that in the test environment. However, when based solely on invariant factors, the prediction mechanism remains consistent across environments. This raises an intriguing question: could we design a predictor that relies only on the invariant factors?

A key challenge in obtaining such an invariant predictor is that it is unclear which parts of the extracted representations of CLIP correspond to the invariant factors (Hyvärinen &

Learning Invariant Causal Mechanism from Vision-Language Models

Pajunen, 1999; Locatello et al., 2019). By analyzing the identifiability of CLIP, we first demonstrate that the representations learned by CLIP can be viewed as a linear combination of invariant and variant factors. Under this condition, we prove that, if interventional data (Refer to Appendix H.1 and Section 6 for details) are available, it is possible to derive a linear projection matrix that maps the representations of CLIP to the invariant factors. The learned projection matrix maps the image and text representations of CLIP into the same invariant space. In this space, it becomes possible to perform invariant prediction. We further analyze the OOD risk of this invariant predictor compared to a regular predictor. We provide conditions where the invariant predictor indeed achieve lower OOD risk.

Guided by the above theoretical analysis, we propose a novel modeling framework called the Invariant Causal Mechanism of CLIP (CLIP-ICM). This framework consists of three stages: (1) Collect the interventional data. (2) Estimate a linear projection to invariant subspace. (3) Perform invariant prediction in the invariant subspace. Under the CLIP-ICM framework, we propose specific methods to obtain interventional data using either image or text data. Additionally, we propose a learning objective for estimating the projection matrix. Notably, our method does not require retraining the backbone network of CLIP. We evaluate the proposed CLIPICM on OOD generalization datasets, including Domainbed (Gulrajani & Lopez-Paz) and variants of Image Net (Recht et al., 2019; Hendrycks et al., 2021b;a; Wang et al., 2019). Experimental results demonstrate the outstanding effectiveness of our approach. Considering its low computational cost and significant performance improvement, our method is both simple and effective.

Our contributions can be summarized as follows: 1) We identify and demonstrate that CLIP exhibits inconsistent performance in the train and test domain when fine-tuned in OOD generalization scenarios. Through an in-depth causal analysis, we demonstrate that it is possible to achieve consistent predictions using only invariant factors. 2) Through an analysis of the identifiability of CLIP, we theoretically prove the existence of a linear mapping from CLIP embeddings to the invariant factors. This mapping can be estimated using interventional data. Additionally, the OOD risk analysis provides a condition to guarantee a lower OOD risk for the invariant predictor. 3) We propose the CLIP-ICM framework, which includes collecting interventional data using either image or text data, estimating the linear projection, and performing invariant prediction. Extensive experiments across various OOD scenarios demonstrate that CLIP-ICM significantly enhances the performance of CLIP. Additionally, a comprehensive set of ablation studies validates the effectiveness of our approach.

2. Related Work

Vision-Language Pre-training. Incorporating language with the visual modality has been a long-standing issue, as extensively researched in previous works (Frome et al., 2013; Socher et al., 2013; Norouzi et al., 2014; Elhoseiny et al., 2013). Recent years have seen significant success in multi-modal pre-training models, spearheaded by CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021), with the aim of enabling cross-modal learning by mapping language and visual modalities into the same embedding space. Notably, researchers have turned their focus towards finetuning CLIP in downstream tasks (Gao et al., 2023; Zhou et al., 2022a;b; Shu et al., 2023). In our work, we delve into the reasons behind the poor OOD generalization ability of CLIP from a causal perspective. We propose CLIP-ICM as a solution to address this issue.

Causal Representation Learning. The latent factor hypothesis posits that observational data is generated by underlying latent factors, with causal representation learning aiming to identify these factors a challenge akin to Independent Component Analysis (ICA) (Bishop, 1998; Bengio et al., 2013). While linear ICA is solvable (Hyvärinen & Oja, 2000), nonlinear ICA requires inductive biases such as auxiliary labels, sparsity constraints, or restricted function classes (Hyvärinen & Pajunen, 1999; Locatello et al., 2019; Schölkopf et al., 2021). Recent work has shifted from purely observational settings (Roeder et al., 2021) to leveraging interventional data (Ahuja et al., 2023; Squires et al., 2023). Aligning with this trend, we propose a method using intervention data to identify invariant features in CLIP, offering a simple yet effective solution.

Out-of-distribution Generaliztion. It is evident in many cases that the performance of deep learning methods is weakened when applied to different distributions (Beery et al., 2018; Taori et al., 2020). Studies have been conducted to address OOD generalization issues (Arjovsky et al., 2020; Ahuja et al., 2020; Gulrajani & Lopez-Paz; Miller et al., 2021; Abbe et al., 2023; Chen et al.). One important branch of work aims to find invariant causal mechanisms across domains, inspired by the invariance principle from causality (Arjovsky et al., 2020; Peters et al., 2016; Suter et al., 2019; Ahuja et al., 2020; 2021; Chen et al.). Additionally, some studies incorporate SCM when analyzing the OOD problem (Robey et al., 2021; Li et al., 2024). Vision-language pre-trained models, such as CLIP (Radford et al., 2021), exhibit impressive performance across various domains. However, recent works emphasize that adapting CLIP with taskspecific data often comes at the cost of OOD generalization ability (Shu et al., 2023; Gao et al., 2023; Pham et al., 2023). We also provide a detailed analysis with some prior works in Section L.

Learning Invariant Causal Mechanism from Vision-Language Models

3. Problem Setting

3.1. Contrastive Language-Image Pre-training

We begin by revisiting the framework for CLIP (Radford et al., 2021) 1. Let x X represent an arbitrary image, and t T denote a text sequence describing this image, forming an image-text pair (x, t). Here, X and T refer to the image and text spaces, respectively. CLIP processes these inputs through two pre-trained encoders: an image encoder f I : X ˆZ and a text encoder f T : T ˆZ. Both encoders map their respective inputs into a shared embedding space ˆZ RD, where the embeddings of the image-text pair (x, t) are aligned. Here, D denotes the dimensionality of the embedding space.

A specific characteristic of CLIP is the zero-shot prediction. The goal of zero-shot prediction is to use a predictor h : X Y to map an input image x X to a predicted category label ˆy Y, where Y denotes the set of possible class labels. For each class c Y, a corresponding text description tc is defined. For instance, for a class c, the description may be something like a photo of a [CLASS] , where [CLASS] is replaced by the name of class c. CLIP encodes both the image x and the text description tc for each class into the shared embedding space ˆZ RD. Then, it computes the similarity between the image embedding f I(x) and the embedding for the text description of each class f T (tc). The predictor h is defined as:

h(x) = arg max c Y P(c|x),

s.t. P(c|x) = exp (S(f I(x), f T (tc))) P

c Y exp (S(f I(x), f T (tc ))).

Here, P(c|x) is the probability of class c given x, S(f I(x), f T (tc)) denotes the cosine similarity between f I(x) and f T (tc). The predicted label ˆy is the one associated with the text description that has the highest similarity to the image embedding.

3.2. Out-Of-Distribution Generalization

Next, we present a unified framework for adapting CLIP to OOD generalization scenarios. In this framework, each distinct scenario (e.g., varying lighting, styles, or camera angles) is treated as an environment e Eall, where Eall represents the set of all possible environments, and each environment corresponds to a unique data distribution P e(x, y). The goal of OOD generalization is to learn a predictor h from training environments Etr Eall to perform well in unseen environments. Achieving this involves addressing two challenges that often occur together in practice: domain shift and open-class scenarios. Here, domain shift refers to when the distribution of x in the test environment dif-

1The definition of all notations is provided in Appendix A.

fers from the distribution of x in the training environment2, while open-class scenarios involve previously unseen classes appearing at test time. Although the zero-shot capability of CLIP excels in handling open-class scenarios, maintaining this ability under domain shift presents a challenge.

Several established methods exist for fine-tuning CLIP to acquire such a predictor h for tasks in OOD scenarios: (1) Linear probe: A linear classifier is trained on the fixed output of f I. (2) Learnable prompt (Zhou et al., 2022b;a; Zhang et al., 2024): The fixed text descriptions for each class are replaced with learnable parameters. (3) Adapter (Gao et al., 2023): Both f I and f T are frozen, and a multilayer perceptron (MLP) is trained on their outputs. (4) Finetuning image encoder (Shu et al., 2023): All parameters of f I are optimized, while f T are fixed. Despite these differences, all methods share a common procedure when fine-tuned in OOD scenarios. A pre-trained CLIP model is taken as the backbone, and some task-specific parameters θ, such as classifier weights or prompt vectors, are introduced. Data from the training environments Etr Eall is then collected. During training, model parameters are optimized to reduce classification errors across the aggregated training distribution. After this phase, the model is evaluated on test environments Eall \ Etr using the updated parameters.

4. Motivation Experiment

In this section, we evaluate the performance of CLIP under OOD generalization using the Terra Incognita dataset (Beery et al., 2018), which contains identical object categories of wildlife species across diverse, in-the-wild environmental conditions. We first examine only the domain shift scenario, where we adopt a leave-one-out protocol and train a linear probe with the frozen CLIP image encoder. Specifically, for a given target domain, a linear classifier is trained on frozen CLIP image embeddings from all other domains and tested on the held-out domain to assess how well the model handles shifts in distribution. Next, we examine a case combining domain shift and open classes to assess whether the strong zero-shot capabilities of CLIP remain after fine-tuning in OOD settings. We divide categories into base and new classes and apply leave-one-out protocol. The image encoder is fine-tuned on base-class data from training domains, while the text encoder remains frozen. The model is then evaluated on both base and new classes in the target domain to assess whether the zero-shot ability of CLIP remains after fine-tuning.

From the results in Table 1, we can draw two important conclusions. First, in the domain shift scenario, the performance of fine-tuning is unsatisfactory. For instance, in the

2We primarily refer to the case in standard domain generalization tasks.

Learning Invariant Causal Mechanism from Vision-Language Models

Table 1. The performance of CLIP on the Terra Incognita dataset in accuracy (%). L100, L38, L43, and L46 represent different environments. In the Linear-Probe case, the values in ( ) represent the accuracy achieved by directly fine-tuning with the linear probe on the target domain.

DOMAIN SHIFT

METHOD L100 L38 L43 L46 AVG

LINEAR-PROBE 73.6 (90.9) 58.3 (86.3) 61.0 (76.0) 47.8 (78.9) 60.2 (83.0)

OPEN CLASS UNDER DOMAIN SHIFT

SPLIT METHOD L100 L38 L43 L46 AVG

BASE ZERO-SHOT 56.4 23.2 30.4 25.8 33.9 FINE-TUNE 75.6 58.5 65.2 55.1 63.6

NEW ZERO-SHOT 47.6 17.6 35.2 37.4 34.5 FINE-TUNE 36.7 11.2 25.1 25.5 24.6

TOTAL ZERO-SHOT 52.0 20.4 32.8 31.6 34.2 FINE-TUNE 56.2 34.9 45.2 40.3 44.2

L46 domain, the accuracy achieved using the leave-one-out protocol (47.8%) is significantly lower by 31.1% compared to directly fine-tuning on this domain (78.9%). This indicates that training on other domains does not ensure good performance on the L46 domain. Second, in the open-class scenario, the zero-shot performance remains nearly consistent across base and new classes. While the fine-tuned model outperforms zero-shot predictions on base classes, it performs worse on new classes. These results suggest that naively fine-tuning the CLIP model in OOD scenarios can harm its strong zero-shot capabilities. This raises a critical question: What causes the discrepancy between training and testing domains in the OOD generalization of CLIP, and how can we preserve the strong zero-shot capacity of CLIP for unseen classes while adapting to OOD scenarios?

5. Theoretical Analysis

5.1. Causal Analysis

To investigate the reasons behind the unsatisfactory OOD generalization of CLIP further, we propose to analyze it from a causal perspective. During analysis, we adopt the latent factor hypothesis, which assumes that a set of latent factors generates the observed data. Based on this, we model both the data-generating and prediction processes using a Structural Causal Model (SCM) in Figure 1 3.

Let X P(x) and Y P(y) be the random variables for images and class labels, defined on X and Y, respectively. A selection variable E P(e), defined on Eall, represents the environment (Pearl & Bareinboim, 2014). The latent

3Please refer to Definition C.1 for the definition of the SCM and the causal mechanism.

(a) Data generation

(b) Prediction

Figure 1. The SCMs for: (a) the data-generating process (b) the prediction process. The square denotes the selection variable, see Appendix C for details.

variable Z P(z), defined on Z = Zinv Zvar, consists of two components: the invariant factor Zinv P(zinv), and the variant factor Zvar P(zvar).

The SCM in Figure 1(a) illustrates the data-generating process. In this SCM, the edges Y Zinv and Y Zvar indicate that the class label Y influences both Zinv and Zvar. The edge E Zvar shows that Zvar is affected by the environment E. Finally, the edges Zinv X and Zvar X represent that the image X is generated from Zinv and Zvar. For example, consider the class label "bird" (Y ). The wing shape and beak shape (Zinv) do not change across different environments. However, feather color (Zvar) appears differently in day or night (E). Resulting in varying images of birds (X) across environments.

The SCM in Figure 1(b) depicts the prediction process, which can be considered as the inverse of the datagenerating process (Bengio et al., 2013; Schölkopf et al., 2021; Zimmermann et al., 2021). In this SCM, the edges X Zvar and X Zinv denote the representation learn-

Learning Invariant Causal Mechanism from Vision-Language Models

+u Ria N2a Jfx3j+Am Xqjsw=</latexit>1

y5Gpk0Zot+He Pp E2h Qjs0=</latexit>2 :

M8f QJ7g I7V</latexit>:

Q230Td Uyac4G/Vra8xdm4Y+W</latexit>E = e

f T (t1) f T (t2)

Pd L9l Hps Hp QLJ+m C8/SNu3QHr Z6TGW6ogrqc JHnh V7pzbg2p DEyxt+u Ria N2a Jfx3j+Am Xqjsw=</latexit>1 2 :

SXVAZdchpvt Arv Wn X2l B71J6+q Voqydmg X0t7/g LWu ZFP</latexit>f T (t1)

exit>Zinv Projection

Kq HVKQLKq EOZc Xeq U3ra I9a I/a0zd VSy U5G/Rrac9f XQ+Thg=</latexit>Y

Pd L9l Hps Hp QLJ+m C8/SNu3QHr Z6TGW6ogrqc JHnh V7pzbg2p DEyxt+u Ria N2a Jfx3j+Am Xqjsw=</latexit>1

VEu2Yelg9p+s XKSLjx L27RDe9jq EVXokqow0We Z3qh V+PKk Mb Ym Hy5Gpk0Zot+He Pp E2h Qjs0=</latexit>2 :

Figure 2. An example of a classifier defined on Z = Zinv Zvar when applied to different environments, where the dashed lines represent the decision boundary and solid arrows depict the normal vectors of these boundaries. Y is the set of labels. (a) illustrates an example of a classifier that is trained and tested in the same environment. (b) demonstrates applying the classifier from (a) to another environment E = e . The change in E only affects the distribution of Zvar. (c) demonstrate the case when samples and classifiers are projected to the Zinv space.

ing process. The edges Zvar Y and Zinv Y represent the prediction process. In OOD prediction scenarios of CLIP, the process X Zinv and X Zvar correspond to the image being passed through the image encoder to obtain its feature representation, i.e., f I(x) = z, where z = [zinv, zvar]T 4. The process Zinv Y and Zvar Y involve comparing the image embedding z with the embeddings of text descriptions f T (tc) for each class c Y, and the predicted label is derived based on the similarity, as shown in Equation (1).

In an SCM, each edge represents a specific causal mechanism. Our focus is the causal mechanism in the OOD prediction scenario. Specifically, we investigate whether a predictor learned from training environment Etr Eall can still perform well in an unknown test environment e Eall. The following proposition outlines how the causal mechanisms change under environmental shifts:

Proposition 5.1. Let P ( ) := P( |e ) denote the distribution in test environment E = e . The causal mechanisms are formalized through interventional distributions. We focus on the predictions P(y|do( )), where do( ) represents an intervention. The causal mechanism satisfies P (y|do(x)) = P(y|do(x)). When using only Zinv, the causal mechanism remains unchanged: P (y|do(zinv)) = P(y|do(zinv)).

The detailed proof is provided in Appendix D. Proposition 5.1 shows that the causal mechanism P(y|do(x)), relying on both invariant and variant factors, varies across environments, making a predictor trained in one domain unusable in another. In contrast, the causal mechanism P(y|do(zinv)), based only on Zinv, remains stable across

4For simplicity, this process assumes an ideal scenario, which we discuss further in Section 5.2.

environments, ensuring a predictor trained in one domain can generalize to others.

To illustrate this process, Figure 2 provides an example. In the figure, we plot the image samples in the latent space Z = Zinv Zvar. Samples from different classes are plotted in different colors. The arrows in varying colors represent the embedding of text descriptions for each class. Figure 2(a) shows the sample distribution in train environment e, while Figure 2(b) depicts the distribution in test environment e . In Figure 2(a), the samples of a given color are similar to the corresponding text description features, ensuring correct predictions in the training environment.5

However, in Figure 2(b), the distribution of samples shifts along the Zvar dimension due to environmental changes. This shift causes samples to move away from their corresponding text embeddings, leading to misclassification. In Figure 2(c), both the samples and text embeddings are projected into the Zinv subspace. Here, the environmental shift no longer affects the prediction. This ensures a consistent performance in both train and test environments.

5.2. Identifiability Analysis

In the previous section, we propose that if the prediction is performed in the space Zinv, the results in test environment remain consistent with training environments. However, in practice, it is unclear which part of the image representation obtained by CLIP corresponds to Zinv. The task of finding invariant factors mirrors the well-established problem of latent factor identification (Schölkopf et al., 2021; Hyvärinen & Oja, 2000). The observational data are assumed to be generated by the latent factor Z through an unknown,

5This remains true regardless of the fine-tuning strategy, even in open-class scenarios. For further discussion, refer to Appendix B.

Learning Invariant Causal Mechanism from Vision-Language Models

non-linear process g. Identifying the latent factors meaning finding a function f : X Z such that f is the inverse of data generation process g : Z X, i.e., f = g 1. In this section, we theoretically prove that it is possible to find a linear mapping from the CLIP embedding space to Zinv.

We aim to explore the ability of CLIP to identify latent variables. To begin with, we assume the data generation process g : Z X is injective. This assumption implies that g is invertible, and different latent factors generate different images. Based on this generation process, we define the ideal image encoder f I : X Z as the inverse of g, i.e. f I (x) = g 1(x). Furthermore, we define f T as the corresponding text encoder for f I . The outputs of pairs (x, t) through f T and f I are aligned in the shared embedding space. For the actual encoders f I and f T in CLIP, we prove in Proposition 5.3 that if Condition 5.2 is satisfied, f I identifies Z up to an invertible linear transformation.

Condition 5.2. For an image encoder f I and a text encoder f T with output dimensions D, for any x sampled from P(x), there exist D + 1 distinct text description pairs (ta, tb) satisfying:

exp(f I(x)T f T (ta)) exp(f I(x)T f T (tb)) = exp(f I (x)T f T (ta))

exp(f I (x)T f T (tb)) . (2)

Proposition 5.3. For an image encoder f I : X ˆZ and its corresponding text encoder f T : T ˆZ, where ˆZ RD

is the embedding space. If Condition 5.2 is satisfied, then f I(x) identifies Z up to an invertible linear transformation A, i.e. f I(x) = Ag 1(x) = Az.

The proof is provided in Appendix E6. Condition 5.2 ensures both consistency and diversity. Consistency is achieved as the similarity between image embeddings and text embeddings produced by f I and f T aligns with those produced by the ideal encoders f I and f T . Diversity is ensured through the existence of sufficient pairs (ta, tb), which guarantees the invertibility of the matrix A. Consequently, satisfying Proposition 5.3 ensures that predictions based on similarity remain consistent with the ideal setting, allowing CLIP to achieve accurate predictions. The remarkable performance of CLIP, supported by its large-scale training data, suggests that it can satisfy this condition in practice.

Proposition 5.3 suggests that the output representation of CLIP is a linear combination of the true latent space, encompassing both invariant and variant factors. As a result, changes in the environment lead to shifts in all dimensions of the output of CLIP. As we discussed in Section 5.1, this can lead to incorrect prediction when the text descrip-

6This conclusion aligns with results established in prior works (Roeder et al., 2021; Ahuja et al., 2022; Hyvarinen & Morioka, 2016; Zimmermann et al., 2021), and Proposition 5.3 demonstrates this result within our specific scenario.

tion remains unchanged. Fortunately, Proposition 5.3 indicates there exists a linear mapping from ˆZ to Zinv. Consider an observed representation ˆz = f I(x), Proposition 5.3 suggests ˆz is a linear transformation of the true latent vector, i.e., ˆz = Az, where A is an invertible matrix, z = [zinv, zvar]T is the true latent vector. The true latent vector can be recovered by applying the inverse of A, so we have z = A 1ˆz. We can decompose A 1 into two blocks A 1 = Ainv Avar T . And zinv can be obtained with zinv = Ainvˆz. Since A is a fixed, invertible matrix, the above derivative shows that there always exists a linear mapping Ainv that maps ˆz to the true invariant factor zinv.

Since both A and the true latent factors z are unobservable. It is unlikely to directly obtain the matrix Ainv through only the observed representation ˆz. However, by using interventional data, where certain latent factors are fixed, as outlined in Condition 5.4, we can prove that it becomes possible to estimate the Ainv.

Condition 5.4. Let P do(zinv)(x) denote an interventional distribution of X. The invariant factors of the images from this distribution are required to take a fixed value, i.e. zinv = z . Here, do( ) denotes an intervention flag. This condition require sampling from P do(zinv)(x) is available.

Proposition 5.5. For the encoder f I satisfies Proposition 5.3, if Condition 5.4 is satisfied and sample from P do(zinv)(x) is possible. Then for any pair (xdo(zinv) 1 , xdo(zinv) 2 ) sampled from P do(zinv)(x). The mapping Ainv satisfies:

Ainv(f I(xdo(zinv) 1 ) f I(xdo(zinv) 2 )) = 0 (3)

The proof is detailed in Appendix F. Proposition 5.5 provides the theoretical foundation for extracting zinv and performing prediction in the Zinv space. Once Condition 5.4 is satisfied, interventional data can be obtained. By repeatedly sampling from this interventional distribution, it becomes possible to estimate a matrix that satisfies Equation (3). This matrix then maps the observable data into the Zinv space, allowing an invariant prediction across environments.

5.3. OOD Generalization Analysis

In the previous sections, we prove the feasibility of projecting representations of CLIP into an invariant subspace and performing predictions within that subspace. In this section, we provide theoretical guarantees for the OOD generalization performance of this approach.

One way to assess OOD generalization is to measure performance in the worst-case environment. Formally, this is captured by the OOD risk:

ROOD(h) = max e Eall Re(h), (4)

Learning Invariant Causal Mechanism from Vision-Language Models

Image-Based

1 2 3 Collect Interven/onal Data Es/mate Perform Invariant Predic/on

C2Yrk Mk=</latexit>f I

R01XEblin Ll Uq QKj Lo CVj5+qg HYzb/Dn Xaqe6Xz KPS4e1Bs Xya Djx L27RDe5jq MZXpiqow4Hm C73Sm1b RBtp Ye/qmapk0Z4t+Le35C2Yrk Mk=</latexit>f I

=</latexit>Ainv

=</latexit>Ainv

Reconstruct

: Learnable

A photo of cat A photo of dog A photo of person

: High Similarity

: Low Similarity

Figure 3. The pipeline of the proposed CLIP-ICM framework. CLIP-ICM consists of three stages: (1) collect interventional data. (2) estimate Ainv. (3) perform invariant prediction.

where Re(h) denotes the risk of h in environment e, and can be calculated with:

Re(h) = E(P e(x,y) [I(h(x) = y)] , (5)

where I(h(x) = y) is the indicator function. It equals 1 if the prediction of h is incorrect and 0 otherwise.

We consider two types of predictors: the conventional predictor h and the invariant predictor hinv. The definition of the conventional predictor is provided in Equation (1), while the invariant predictor is defined as follows:

hinv(x) = arg max c Y P(c|x, Ainv), (6)

where P(c|x, Ainv) is the probability of class c computed in the invariant subspace. It can be obtained by projecting with Ainv and applying the following softmax function:

Pinvk(c|x) = exp (S(Ainvf I(x), Ainvf T (tc))) P

c Y exp (S(Ainvf I(x), Ainvf T (tc ))).

By investigating the OOD risk of these predictors, we present the following theorem:

Theorem 5.6. Let H be a hypothesis class over X Y, and let Eall denote a set of environments with distributions {P e}e Eall over X Y. Let hinv H be a predictor relying solely on Zinv, and h H a predictor utilizing both Zinv and Zvar. If the mutual information between Zinv and Z satisfies I(Zinv; Z) > c for some constant c > 0. Then, the worst-case OOD risk strictly satisfies:

ROOD(hinv) < ROOD(h), (8)

The proof is detailed in Appendix G. Theorem 5.6 state that, to ensure a smaller OOD risk, one needs to ensure a high I(Zinv; Z). This means Zinv should contain enough information related to Z, thereby ensuring that the invariant predictor also performs well in the training environment.

Building on the theoretical results in Section 5, we propose Invariant Causal Mechanism of CLIP (CLIP-ICM). CLIP-ICM comprises three stages: (1) Collect interventional data. (2) Estimating Ainv using the interventional data. (3) Perform prediction in the invariant subspace. The key challenges lie in two aspects: acquiring interventional data and estimating Ainv satisfying the condition in both Theorem 5.6 and Proposition 5.5.

Collect Interventional data. According to Condition 5.4, interventional data are those generated by fixing the invariant factors while allowing the variant factors to vary. There are numerous ways to obtain such data (Nikolenko, 2021), including the use of rendering engines (Zhang et al., 2025), generative models (Zhang et al., 2023a; Azizi et al.), or large language models (Li et al., 2023; Open AI, 2023). Leveraging the property of CLIP to align image and text features within the same embedding space, we propose two approaches for collecting interventional data: one based on images and the other based on text.

We first collect image data from a subset of training environments Etr Eall. The interventional data is generated by applying data augmentation techniques. Data augmentation preserves Zinv while altering Zvar. This works because

Learning Invariant Causal Mechanism from Vision-Language Models

invariant factors, like class-relevant features, remain unchanged, while variant factors, such as color, orientation, or background, are modified. We use a combination of methods, including color jittering, grayscale, and Gaussian blur, to create diverse transformations. Details on these techniques are provided in Appendix H, and an ablation study of specific augmentations is discussed in Appendix M.1. For each image x, we apply a random augmentation α( ) to generate an augmented sample α(x). Embeddings of the pair f I(α(x)), f I(x) are then used as interventional data.

Due to the ability of CLIP to align the image and text embedded in the same latent space, interventional data can also be generated by modifying the text descriptions associated with a given image. We use the same training images from Etr. Pre-intervention text descriptions are generated for each image using an image captioning model. Post-intervention text descriptions are created by modifying these captions using a large language model. Details of the prompts used for generating and modifying text descriptions are provided in Appendix H.1. Finally, the embeddings of the preand post-intervention text descriptions, f T (t) and f T (β(t)), are used as interventional data for further processing.

Estimate Ainv. Once the interventional data is obtained, the next step is to learn the Ainv matrix. The Ainv matrix must satisfy two key conditions: (1) According to Proposition 5.5, Ainv should satisfy Equation (3), ensuring that the embeddings obtained from interventional data pairs are equal after applying Ainv. (2) Based on Theorem 5.6, the projected embedding should retain as much information about the original representation f I(x) as possible. To meet these conditions, we formulate the learning objective as the following optimization problem:

min Ainv ˆZ Ainv AT inv ˆZ 2 F + λ AT inv( ˆZ ˆZdo(zinv)) 2 F ,

s.t. AT inv Ainv = IDinv. (9)

Here, ˆZ represents the matrix of observable embeddings obtained from either the image encoder f I or the text encoder f T . ˆZdo(zinv) corresponds to the embeddings obtained from interventional data, where the invariant factors Zinv are fixed while the variant factors Zvar vary. IDinv denotes the identity matrix of dimension Dinv, and λ is a regularization hyperparameter. The first term of Equation (9) along with the constraint AT inv Ainv = IDinv ensures that ˆZ is effectively reconstructed within the invariant subspace defined by Ainv. The second term enforces consistency between the embeddings of interventional data in the invariant subspace.

Perform invariant prediction. After obtaining Ainv through Equation (9), we can perform invariant prediction. For a test image x and a fixed set of text prompts {tc}c Y, the embeddings are first computed using f I and f T , respectively. These embeddings are then projected into the

invariant subspace using Ainv. Predictions are made using the invariant predictor defined in Equation (6). 7

7. Experiment

To evaluate the CLIP-ICM framework in OOD scenarios, we conduct experiments on the Domain Bed benchmark (Gulrajani & Lopez-Paz). We use five datasets from Domain Bed: PACS (Li et al., 2017), VLCS (Fang et al., 2013), Office Home (Venkateswara et al., 2017), Terra Incognita (Beery et al., 2018), and Domain Net (Peng et al., 2019).8

We evaluate CLIP-ICM for OOD generalization in two settings: (1) domain shift and (2) domain shift with open-class scenarios. For domain shift, we use a leave-one-out protocol, training on all domains except the target and testing on the target domain (Table 2). For the combined setting, we split data into base and new classes, train on base classes in training domains, and evaluate both base and new classes in the target domain (Table 3). Each value in Table 2 and Table 3 represents the mean and standard deviation over 5 runs with different random seeds.

We compare three interventional data collection methods: models marked with use image-based data, use textbased data, and unmarked use both. We also compare two prediction strategies: CLIP-ICM, which maps embeddings through Ainv, and CLIP-ICM Linear Probe, which trains a classifier on image embeddings projected by Ainv.

Results. From Table 2, CLIP-ICM consistently outperforms previous fine-tuning methods for CLIP. On the Terra Incognita dataset, CLIP-ICM Linear Probe surpasses the best-performing method, CLIPOOD, by 6%. From Table 3, while CLIP-ICM does not achieve the best performance on base classes, it significantly outperforms other methods on new classes. The average performance on new classes exceeds the best-performing method, Co Op, by 5.5%. This shows that the CLIP-ICM learning method preserves the zero-shot capability of CLIP even after fine-tuning. Comparing the three approaches for interventional data, models using only image-based data outperform those using only text-based data. Combining both further improves performance, highlighting their complementary benefits.

8. Conclusion

In our work, we observe the limitations of CLIP in OOD scenarios. Causal analysis reveals that predictors relying

7Alternatively, predictions can be made by projecting only the image embeddings through Ainv and training a classifier on the projected space. This approach is more suitable when the classes in the training environments are identical to those in the test environments (i.e., no open-class scenarios). 8See Appendices I, J and O for additional results and Appendix M for ablation studies.

Learning Invariant Causal Mechanism from Vision-Language Models

Table 2. Accuracy on the Domain Bed benchmark with domain shift. Methods with indicate training with only image-based interventional data. Methods with indicate training with only text-based interventional data.

METHOD BACKBONE PACS VLCS OFFICEHOME TERRAINC DOMAINNET AVG.

ZERO-SHOT CLIP 96.1 82.4 71.5 34.2 56.8 68.2 LINEAR-PROBE CLIP 96.4 0.1 78.7 0.2 81.9 0.4 60.2 0.2 55.0 0.4 74.4 0.4 MIRO (CHA ET AL., 2022) CLIP 95.6 0.2 82.2 0.1 82.5 0.3 54.3 0.2 54.0 0.5 73.7 0.5 COOP (ZHOU ET AL., 2022B) CLIP 97.0 0.2 83.0 0.1 81.1 0.4 54.6 0.2 59.5 0.2 75.0 0.2 COCOOP (ZHOU ET AL., 2022A) CLIP 96.7 0.4 83.6 0.1 80.7 0.1 56.2 0.3 59.7 0.4 75.4 0.4 CLIP-ADAPTER (GAO ET AL., 2023) CLIP 96.4 0.3 84.3 0.5 82.2 0.2 57.5 0.4 59.9 0.1 76.1 0.1 DPL (ZHANG ET AL., 2023B) CLIP 97.3 0.5 84.3 0.1 84.2 0.2 52.6 0.4 56.7 0.4 75.0 0.4 CLIPOOD CLIP 97.3 0.1 85.0 0.4 87.0 0.2 60.4 0.7 63.5 0.1 78.6 0.1

CLIP-ICM CLIP 97.3 0.5 84.1 0.4 82.6 0.3 49.9 0.3 60.5 0.3 74.9 0.4 CLIP-ICM LINEAR-PROBE CLIP 97.5 0.5 86.5 0.1 84.6 0.4 64.3 0.3 64.0 0.2 79.0 0.3 CLIP-ICM CLIP 96.8 0.4 83.4 0.3 82.1 0.4 45.2 0.3 57.4 0.2 73.0 0.3 CLIP-ICM LINEAR-PROBE CLIP 97.2 0.5 85.2 0.5 82.4 0.3 61.2 0.1 59.6 0.4 77.1 0.4 CLIP-ICM CLIP 97.7 0.2 86.2 0.3 84.6 0.2 52.5 0.4 61.1 0.3 76.4 0.3 CLIP-ICM LINEAR-PROBE CLIP 97.8 0.3 86.6 0.1 87.1 0.4 66.5 0.1 65.0 0.1 80.6 0.2

Table 3. Accuracy on Office Home and Domain Net with both domain shift and open classes. Methods with indicate training with only image-based interventional data. Methods with indicate training with only text-based interventional data.

SPLIT METHOD OFFICEHOME DOMAINNET

A C P R C I P Q R S

CLIP 86.8 75.5 89.5 92.6 72.8 51.7 66.0 13.5 83.4 66.9 COOP 87.0 0.4 78.3 1.2 92.4 0.2 91.4 0.6 75.7 0.2 58.8 0.5 68.5 1.3 13.1 1.0 84.0 0.5 70.0 0.1 CLIPOOD 90.1 0.2 79.7 0.2 93.1 0.1 94.8 0.1 79.0 0.2 62.2 0.1 73.0 0.2 20.2 0.2 86.2 0.1 73.8 0.1 CLIP-ICM 88.6 0.4 78.0 0.2 90.2 0.3 93.1 0.2 74.4 0.3 53.2 0.3 67.2 0.3 14.6 0.2 85.2 0.3 67.8 0.2 CLIP-ICM 87.1 0.1 77.2 0.2 89.3 0.5 92.0 0.2 73.1 0.2 52.0 0.1 66.1 0.1 14.1 0.1 83.8 0.4 66.8 0.4 CLIP-ICM 89.2 0.4 78.6 0.4 90.6 0.1 93.7 0.2 75.0 0.3 53.9 0.3 67.6 0.4 14.9 0.3 85.9 0.3 68.4 0.4

CLIP 76.6 59.4 88.1 86.2 70.2 44.1 66.4 14.1 83.5 61.0 COOP 76.5 1.1 56.6 2.4 88.0 1.9 86.8 0.7 71.5 0.2 47.2 0.3 67.3 0.7 14.8 0.7 83.7 0.7 63.1 0.3 CLIPOOD 77.8 0.2 60.0 0.2 88.3 0.1 86.7 0.1 71.2 0.1 48.1 0.1 68.2 0.2 18.0 0.4 83.4 0.1 62.9 0.1 CLIP-ICM 81.7 0.4 66.5 0.5 90.2 0.4 90.6 0.4 76.7 0.2 50.9 0.2 69.1 0.5 17.2 0.5 83.6 0.3 67.7 0.5 CLIP-ICM 81.2 0.2 65.2 0.2 89.4 0.4 89.9 0.2 76.0 0.2 49.8 0.1 67.9 0.2 15.7 0.1 82.5 0.1 67.0 0.4 CLIP-ICM 82.6 0.1 67.5 0.2 90.9 0.3 91.5 0.3 77.8 0.1 51.6 0.5 70.2 0.2 18.0 0.4 84.5 0.2 68.6 0.5

CLIP 82.6 67.3 88.8 89.5 71.4 47.1 66.2 13.8 83.4 63.4 COOP 82.7 0.5 67.2 0.7 90.2 1.0 89.2 0.6 73.4 0.3 51.8 0.3 67.9 1.0 13.7 0.8 83.9 0.5 66.0 0.2 CLIPOOD 85.1 0.1 69.6 0.2 90.8 0.1 91.0 0.1 74.8 0.1 53.6 0.1 70.6 0.1 19.1 0.3 84.8 0.1 67.4 0.1 CLIP-ICM 85.2 0.4 72.3 0.3 90.2 0.3 91.9 0.3 75.6 0.2 52.1 0.2 68.2 0.4 15.9 0.3 84.4 0.3 67.8 0.3 CLIP-ICM 84.2 0.1 71.2 0.2 89.4 0.5 91.0 0.2 74.6 0.2 50.9 0.1 67.0 0.1 14.9 0.1 83.2 0.2 66.9 0.4 CLIP-ICM 85.9 0.2 73.1 0.3 90.8 0.2 92.6 0.2 76.4 0.2 52.8 0.4 68.9 0.3 16.5 0.3 85.2 0.2 68.5 0.5

solely on invariant factors achieve consistent performance across training and test environments. Identifiability analysis further guarantees the existence of a linear mapping from CLIP embeddings to an invariant subspace, which can be estimated using interventional data. If certain conditions are met, the invariant predictor achieves a lower OOD risk. Building on these insights, we propose CLIPICM, a framework for collecting interventional data and estimating the required linear mapping. Experiments on benchmark datasets demonstrate the superior performance of CLIP-ICM compared to existing methods.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal

consequences of our work, none of which we feel must be specifically highlighted here.

Acknowledgements

The authors would like to thank the anonymous reviewers for their valuable comments. This work is supported in part by the China Postdoctoral Science Foundation, Grant No.2024M753356, and in part by the National Natural Science Foundation of China, No. 62406313.

Abbe, E., Bengio, S., Lotfi, A., and Rizk, K. Generalization on the Unseen, Logic Reasoning and Degree Curriculum. In Proceedings of the 40th International Conference on

Learning Invariant Causal Mechanism from Vision-Language Models

Machine Learning, pp. 31 60. PMLR, July 2023. ISSN: 2640-3498.

Ahuja, K., Shanmugam, K., Varshney, K., and Dhurandhar, A. Invariant Risk Minimization Games. In Proceedings of the 37th International Conference on Machine Learning, pp. 145 155. PMLR, November 2020. ISSN: 2640-3498.

Ahuja, K., Caballero, E., Zhang, D., Gagnon-Audet, J.-C., Bengio, Y., Mitliagkas, I., and Rish, I. Invariance principle meets information bottleneck for out-of-distribution generalization. Advances in Neural Information Processing Systems, 34:3438 3450, 2021.

Ahuja, K., Mahajan, D., Syrgkanis, V., and Mitliagkas, I. Towards efficient representation identification in supervised learning. In Conference on Causal Learning and Reasoning, pp. 19 43. PMLR, 2022.

Ahuja, K., Mahajan, D., Wang, Y., and Bengio, Y. Interventional Causal Representation Learning. In Proceedings of the 40th International Conference on Machine Learning, pp. 372 407. PMLR, July 2023. ISSN: 2640-3498.

Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez Paz, D. Invariant Risk Minimization, March 2020. ar Xiv:1907.02893 [cs, stat].

Azizi, S., Kornblith, S., Saharia, C., Norouzi, M., and Fleet, D. J. Synthetic data from diffusion models improves imagenet classification. Transactions on Machine Learning Research.

Beery, S., Van Horn, G., and Perona, P. Recognition in terra incognita. In Proceedings of the European conference on computer vision (ECCV), pp. 456 473, 2018.

Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. A theory of learning from different domains. Machine learning, 79:151 175, 2010.

Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8): 1798 1828, 2013.

Bishop, C. M. Latent Variable Models. In Jordan, M. I. (ed.), Learning in Graphical Models, pp. 371 403. Springer Netherlands, Dordrecht, 1998. ISBN 978-94-010-6104-9 978-94-011-5014-9. doi: 10.1007/978-94-011-5014-9_ 13.

Cha, J., Lee, K., Park, S., and Chun, S. Domain generalization by mutual-information regularization with pretrained models. In European conference on computer vision, pp. 440 457. Springer, 2022.

Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning, pp. 1597 1607. PMLR, November 2020. ISSN: 2640-3498.

Chen, Y., Zhou, K., Bian, Y., Xie, B., Wu, B., Zhang, Y., KAILI, M., Yang, H., Zhao, P., Han, B., et al. Pareto invariant risk minimization: Towards mitigating the optimization dilemma in out-of-distribution generalization. In The Eleventh International Conference on Learning Representations.

Chuang, C.-Y., Torralba, A., and Jegelka, S. Estimating generalization under distribution shifts via domain-invariant representations. In International Conference on Machine Learning, pp. 1984 1994. PMLR, 2020.

Elhoseiny, M., Saleh, B., and Elgammal, A. Write a classifier: Zero-shot learning using purely textual descriptions. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2584 2591, 2013.

Fang, C., Xu, Y., and Rockmore, D. N. Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1657 1664, 2013.

Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., and Mikolov, T. Devise: A deep visualsemantic embedding model. Advances in neural information processing systems, 26, 2013.

Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., March, M., and Lempitsky, V. Domainadversarial training of neural networks. Journal of machine learning research, 17(59):1 35, 2016.

Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., and Qiao, Y. CLIP-Adapter: Better Vision-Language Models with Feature Adapters. International Journal of Computer Vision, September 2023. ISSN 0920-5691, 1573-1405. doi: 10.1007/s11263-023-01891-x.

Glymour, M., Pearl, J., and Jewell, N. P. Causal Inference in Statistics: A Primer. John Wiley & Sons, January 2016. ISBN 978-1-119-18686-1. Google-Books-ID: I0V2Cw AAQBAJ.

Gulrajani, I. and Lopez-Paz, D. In search of lost domain generalization. In International Conference on Learning Representations.

Hendrycks et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 8340 8349, 2021a.

Learning Invariant Causal Mechanism from Vision-Language Models

Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., and Song, D. Natural adversarial examples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15262 15271, 2021b.

Hyvarinen, A. and Morioka, H. Unsupervised feature extraction by time-contrastive learning and nonlinear ica. Advances in neural information processing systems, 29, 2016.

Hyvärinen, A. and Oja, E. Independent component analysis: algorithms and applications. Neural Networks, 13(4): 411 430, June 2000. ISSN 0893-6080. doi: 10.1016/ S0893-6080(00)00026-5.

Hyvärinen, A. and Pajunen, P. Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 12(3):429 439, April 1999. ISSN 0893-6080. doi: 10.1016/S0893-6080(98)00140-3.

Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q., Sung, Y.-H., Li, Z., and Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pp. 4904 4916. PMLR, 2021.

Koh, P. W., Sagawa, S., Marklund, H., Xie, S. M., Zhang, M., Balsubramani, A., Hu, W., Yasunaga, M., Phillips, R. L., Gao, I., Lee, T., David, E., Stavness, I., Guo, W., Earnshaw, B. A., Haque, I. S., Beery, S., Leskovec, J., Kundaje, A., Pierson, E., Levine, S., Finn, C., and Liang, P. WILDS: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning (ICML), 2021.

Li, D., Yang, Y., Song, Y.-Z., and Hospedales, T. M. Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pp. 5542 5550, 2017.

Li, H., Pan, S. J., Wang, S., and Kot, A. C. Domain generalization with adversarial feature learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5400 5409, 2018.

Li, Z., Zhu, H., Lu, Z., and Yin, M. Synthetic data generation with large language models for text classification: Potential and limitations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 10443 10461, 2023.

Li, Z., Cai, R., Chen, G., Sun, B., Hao, Z., and Zhang, K. Subspace identification for multi-source domain adaptation. Advances in Neural Information Processing Systems, 36, 2024.

Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Schölkopf, B., and Bachem, O. Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, pp. 4114 4124. PMLR, 2019.

Miller, J. P., Taori, R., Raghunathan, A., Sagawa, S., Koh, P. W., Shankar, V., Liang, P., Carmon, Y., and Schmidt, L. Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In International Conference on Machine Learning, pp. 7721 7735. PMLR, 2021.

Nikolenko, S. I. Synthetic data for deep learning, volume 174. Springer, 2021.

Norouzi, M., Mikolov, T., Bengio, S., Singer, Y., Shlens, J., Frome, A., Corrado, G. S., and Dean, J. Zero-shot learning by convex combination of semantic embeddings. In 2nd International Conference on Learning Representations, ICLR 2014, 2014.

Open AI. GPT-4 Technical Report, December 2023. ar Xiv:2303.08774 [cs].

Pearl, J. Causality. Cambridge university press, 2009.

Pearl, J. and Bareinboim, E. External Validity: From Do Calculus to Transportability Across Populations. Statistical Science, 29(4):579 595, November 2014. ISSN 0883-4237, 2168-8745. doi: 10.1214/14-STS486. Publisher: Institute of Mathematical Statistics.

Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., and Wang, B. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 1406 1415, 2019.

Peters, J., Bühlmann, P., and Meinshausen, N. Causal Inference by using Invariant Prediction: Identification and Confidence Intervals. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(5):947 1012, November 2016. ISSN 1369-7412, 1467-9868. doi: 10.1111/rssb.12167.

Pham, H., Dai, Z., Ghiasi, G., Kawaguchi, K., Liu, H., Yu, A. W., Yu, J., Chen, Y.-T., Luong, M.-T., Wu, Y., Tan, M., and Le, Q. V. Combined scaling for zero-shot transfer learning. Neurocomputing, 555:126658, October 2023. ISSN 09252312. doi: 10.1016/j.neucom.2023.126658.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 8748 8763. PMLR, July 2021. ISSN: 2640-3498.

Learning Invariant Causal Mechanism from Vision-Language Models

Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. Do imagenet classifiers generalize to imagenet? In International conference on machine learning, pp. 5389 5400. PMLR, 2019.

Robey, A., Pappas, G. J., and Hassani, H. Model-based domain generalization. Advances in Neural Information Processing Systems, 34:20210 20229, 2021.

Roeder, G., Metz, L., and Kingma, D. On linear identifiability of learned representations. In International Conference on Machine Learning, pp. 9030 9039. PMLR, 2021.

Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks. In International Conference on Learning Representations.

Schölkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A., and Bengio, Y. Toward Causal Representation Learning. Proceedings of the IEEE, 109(5):612 634, May 2021. ISSN 1558-2256. doi: 10.1109/JPROC.2021.3058954. Conference Name: Proceedings of the IEEE.

Shu, Y., Guo, X., Wu, J., Wang, X., Wang, J., and Long, M. Clipood: Generalizing clip to out-of-distributions. In International Conference on Machine Learning, pp. 31716 31731. PMLR, 2023.

Socher, R., Ganjoo, M., Manning, C. D., and Ng, A. Zeroshot learning through cross-modal transfer. Advances in neural information processing systems, 26, 2013.

Squires, C., Seigal, A., Bhate, S. S., and Uhler, C. Linear causal disentanglement via interventions. In International Conference on Machine Learning, pp. 32540 32560. PMLR, 2023.

Suter, R., Miladinovic, D., Schölkopf, B., and Bauer, S. Robustly disentangled causal mechanisms: Validating deep representations for interventional robustness. In International Conference on Machine Learning, pp. 6056 6065. PMLR, 2019.

Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B., and Schmidt, L. Measuring robustness to natural distribution shifts in image classification. Advances in Neural Information Processing Systems, 33:18583 18599, 2020.

Venkateswara, H., Eusebio, J., Chakraborty, S., and Panchanathan, S. Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5018 5027, 2017.

Von Kügelgen, J., Sharma, Y., Gresele, L., Brendel, W., Schölkopf, B., Besserve, M., and Locatello, F. Selfsupervised learning with data augmentations provably isolates content from style. Advances in neural information processing systems, 34:16451 16467, 2021.

Wang, H., Ge, S., Lipton, Z., and Xing, E. P. Learning robust global representations by penalizing local predictive power. Advances in Neural Information Processing Systems, 32, 2019.

Yan, S., Song, H., Li, N., Zou, L., and Ren, L. Improve unsupervised domain adaptation with mixup training. ar Xiv preprint ar Xiv:2001.00677, 2020.

Yang, M., Fang, Z., Zhang, Y., Du, Y., Liu, F., Ton, J.-F., Wang, J., and Wang, J. Invariant learning via probability of sufficient and necessary causes. Advances in Neural Information Processing Systems, 36:79832 79857, 2023.

Zhang, L., Rao, A., and Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836 3847, 2023a.

Zhang, L., Rao, A., and Agrawala, M. Scaling in-the-wild training for diffusion-based illumination harmonization and editing by imposing consistent light transport. In The Thirteenth International Conference on Learning Representations, 2025.

Zhang, M., Marklund, H., Dhawan, N., Gupta, A., Levine, S., and Finn, C. Adaptive risk minimization: Learning to adapt to domain shift. Advances in Neural Information Processing Systems, 34:23664 23678, 2021.

Zhang, X., Gu, S. S., Matsuo, Y., and Iwasawa, Y. Domain prompt learning for efficiently adapting clip to unseen domains. Transactions of the Japanese Society for Artificial Intelligence, 38(6):B MC2_1, 2023b.

Zhang, Y., Li, J., Liu, L., and Qiang, W. Rethinking misalignment in vision-language model adaptation from a causal perspective. ar Xiv preprint ar Xiv:2410.12816, 2024.

Zhou, K., Yang, J., Loy, C. C., and Liu, Z. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16816 16825, 2022a.

Zhou, K., Yang, J., Loy, C. C., and Liu, Z. Learning to Prompt for Vision-Language Models. International Journal of Computer Vision, 130(9):2337 2348, September 2022b. ISSN 0920-5691, 1573-1405. doi: 10.1007/s11263-022-01653-1. ar Xiv:2109.01134 [cs].

Learning Invariant Causal Mechanism from Vision-Language Models

Zimmermann, R. S., Sharma, Y., Schneider, S., Bethge, M., and Brendel, W. Contrastive learning inverts the data generating process. In International conference on machine learning, pp. 12979 12990. PMLR, 2021.

Learning Invariant Causal Mechanism from Vision-Language Models

This supplementary material provides detailed proofs for the theorem and proposition mentioned in the main text. Furthermore, additional experimental results and implementation details are provided.

Appendix A provides the definition of all notations of the main text.

Appendix B presents a detailed discussion of various fine-tuning strategies within the context established in Section 5.1.

Appendix C provides some background of the causality.

Appendix D provides the proof of Proposition 5.1 in the main text.

Appendix E provides the proof of Proposition 5.3 in the main text.

Appendix F provides the proof of Proposition 5.5 in the main text.

Appendix G provides the proof of Theorem 5.6 in the main text.

Appendix H provides the implementation details for CLIP-ICM.

Appendix I provides the performance of CLIP-ICM in the setting of Section 4.

Appendix J provides the experimental results in Image Net and variants of Image Net.

Appendix K provides detailed description of the used datasets.

Appendix L provides a detailed comparison with the existing works.

Appendix M provides the ablation study.

Appendix N provides an analysis where the domain prompt is used in zero-shot case.

Appendix O provides full results in each domain in Domain Net Benchmark.

A. List of Notations

We list the definitions of all notations from the main text as follows:

x: An image sample. t: A text description. e: An environment. e : The test environment. c: A class. y: A ground-truth label. ˆy: A predicted label. ˆz: The image embedding. z: A latent factor. zinv: An invariant latent factor. zvar: A variant latent factor. w: The normal vector of the decision boundary.

Random Variables

X: The random variable of the image. Y : The random variable of the class label. Zinv: The random variable of the invariant factors.

Learning Invariant Causal Mechanism from Vision-Language Models

Zvar: The random variable of the variant factors. Z var: The random variable of variant factors in environment e . E: The selection variable of environments.

Sample Spaces / Supports

T : The support of text description. X: The support of image samples. X do(zinv): The support of image samples when the latent factor zinv takes a fixed value. Y: The support of the class label. ˆZ: The support of the embedding. Z: The support of the latent factors. Zinv: The invariant subspace. Zvar: The variant subspace.

f I: The image encoder. f T : The text encoder. f I : The ideal image encoder. f T : The ideal text encoder. g: The ideal data-generating process. f: The ideal representation learning process. h: The regular predictor. hinv: The invariant predictor. I: The indicator function. W: The coefficient matrix of a linear classifier. P: The probability distribution. P : The probability distribution in environment e . P do(zinv): The distribution of other latent factors when zinv takes a fixed value. Ainv: The mapping from ˆZ to Zinv. Avar: The mapping from ˆZ to Zvar. α: The data augmentation function. β: The function to alter text description.

Etr: The set of all training environments. Eall: The set of all environments. D: The datasets from training environments. De: The dataset in environment e. H: The hypothesis class of predictor. ˆZ: The collection of embedding vectors. ˆZdo(zinv): The collection of embeddings obtained from interventional data, where the invariant factors Zinv are fixed while the variant factors Zvar vary.

Functionals

ROOD: The OOD risk. Re: The generalization risk in environment e. Rtr: The generalization risk on Etr. R : The generalization risk in environment e .

Learning Invariant Causal Mechanism from Vision-Language Models

D: The dimension of embedding space / latent factors. Dinv: The dimension of the invariant factors. Dvar: The dimension of the variant factors. N e: The number of instances in environment e. C: The total number of classes. K: The total number of data augmentations.

B. Discussion on Fine-tune Strategies

Zinv Before fine-tune Base New f I(x1)

(a) Before Fine-tune

Zinv Linear Probe

w1z + b1 = 0

atexit>2 : :

Base New f I(x1)

Weights of classifier:

(b) Linear Probe

Zinv Fine-tune Image Encoder Base New

q6Ie079w3Emv8e UNHKsc Vtmk NKqal4i Hx CJdk/Jfp JMyv Wv7Pj Lu KUMe G7MZifb5E4j7Nb50d Rg Ji TRn JYVcy G9Qw5Pm KL+DSFl B/Mpf Cjn Zc Y1Wl1ZIFTd R1Kk X0Mavz3o4Zu3UHud0+WCtl ZYPV7Jb24l A09h Fn NY4FTXs Yl9HLEOE7d4wj Nel Lpyp9wr D59Up S/Jmca Pp Tx+ACu Am S4=</latexit>f I (x1)

8w Ulrk Ppa JOdn Sro B/Qfn Df Sqz+7w0dq Rx V2Kb Vq Tgm FU+Ih7gmo1+m HTO/a+mf GXUVo Zt2Y3J+jy JRH0a Pzr7j Pj EGj KSwo Fk1qmhy3OL+DQ5l B9Mrf Cin Zc ZVWk1ZIFSd W1Kjn0avz3o45uzfof Y6h Vwmu5n ZOFt P7+z GA09g EUt Y4VS3s IMjn LIOA3d4xgtel Zpyrzwoj19UZSDOmcevp Tx9Ai3nm S8=</latexit>f I (x2)

Ge Mv Vy OTxmz Qj2M8f QJ7g I7V</latexit>: Optimized

Image Encoder

(c) Fine-tune f I

Zinv Adapter

Base New WIf I(x1)

So Q=</latexit>WT f T (t1)

WT f T (t2)

Image adapter Text adapter

(d) Adapter

Learnable Prompt Base New

f T ([V1, V2, . . . , t1])

f T ([V1, V2, . . . , t2])

[V1, V2, . . . ] Learnable tokens

(e) Learnable Prompt

Figure 4. An illustration of the fine-tuning process under in the setting of Section 5.1. The red circle represents samples from the base class while the green square represents the samples from the new class.

In this section, we discuss why almost all fine-tuning methods can be understood as leading to the scenario illustrated in Figure 2. This is detailed in Figure 4.

Before fine-tuning on the training domains, the image embeddings and text embeddings are not well aligned, as shown in Figure 4(a). However, there exists class-specific text embeddings f T (t1), f T (t2) for both base and new classes.

When fine-tuning with linear probe, the method learns a classifier specifically for the base classes, with parameters w1, b1. Within the base classes, the classifier parameters w1 align with the image embeddings, as illustrated in Figure 4(b). Thus, it satisfies the conditions depicted in Figure 2(a). However, linear probe fine-tuning does not utilize information from CLIP s text encoder, making it unable to predict new classes.

When only the image encoder is fine-tuned, the distribution of image embeddings changes, while the text embeddings remain fixed. This adjustment aligns the image embeddings for base classes in the training domains with their corresponding text embeddings. For new classes, the image embedding distribution also shifts, and ideally, they should align with the fixed text embeddings, as shown in Figure 4(c). This satisfies the conditions described in Figure 2(a).

When using an adapter for fine-tuning on the training domains, both image and text embeddings are altered. This ensures that the image and text embeddings for base classes are aligned. Ideally, by applying the same adapter to new classes, the

Learning Invariant Causal Mechanism from Vision-Language Models

text and image embeddings of new classes should also align in the training environment, as depicted in Figure 4(d).

When using learnable prompts for fine-tuning, the image embeddings remain unchanged, but the introduction of learnable tokens modifies the text embeddings. This allows the text embeddings of base classes to align with the image embeddings in the training domains. Ideally, by applying the same learnable tokens to new classes, the text embeddings of new classes adjust accordingly, aligning with the image embeddings in the training environment, as shown in Figure 4(e).

It is important to note that these fine-tuning methods utilize the same text description embeddings or linear classifiers during testing across environments. However, in the test environments, the distribution of image data shifts, potentially leading to the situation illustrated in Figure 2(b), where incorrect predictions occur for both base and new classes.

Worse still, when fine-tuning is performed only on the base classes, the learnable parameters may overfit to the base classes. This overfitting can result in the misalignment of embeddings in Figure 4(c), Figure 4(d), and Figure 4(e) even within the training environment for new classes. Consequently, the misalignment is further exacerbated in the test environments, making it even more challenging to correctly classify new classes.

C. Background in Causality

Structural Causal Model The SCMs can be used to describe the causal variables and their relationships. The definition of SCM is:

Definition C.1. (Structural Causal Model (Pearl, 2009)) 1. A set U of background or exogenous variables, representing factors outside the model, which nevertheless affect relationships within the model. 2. A set V = {V1, . . . , Vn} of endogenous variables, assumed to be observable. Each of these variables is functionally dependent on some subset PAi of U V . Here, PAi means the set of parent nodes of the endogenous variable Vi. 3. A set F of functions {f1, . . . , fn} such that each fi determines the value vi of Vi V , vi = fi(pai, u). These functions are also known as causal mechanisms. 4. A joint probability distribution P(u) over U.

The SCMs can be represented by Directed Acyclic Graphs (DAG) (Glymour et al., 2016). In the DAG, the arrow X Y denotes that changes in the value of X directly cause changes in Y . However, this relationship does not hold in the reverse order.

The Selection Diagram and Transportability Pearl et al. (Pearl & Bareinboim, 2014) propose the selection diagram, which is used to model the causal relation between different environments with an external selection variable. The definition of the selection diagram is:

Definition C.2. (Selection diagram): Let < M, M > be a pair of SCMs relative to domain < Π, Π >, sharing a causal diagram G. < M, M > is said to induce a selection diagram D if D is constructed as follows: 1. Every edge in G is also an edge in D 2. D contains an extra edge Si Vi whenever there might exist a discrepancy fi = f i or P(ui) = P (ui) between M and M . The set S are denoted as selection variables.

Transportability aims to study whether the causal relation in a specific environment can be applied to any other environment, which is defined as:

Definition C.3. (Transportability): Let D be a selection diagram relative to domain < Π, Π >. Let < P, I > be the pair of observational and interventional distributions of Π, and P be the observational distribution of Π . The causal relation R(Π ) = P (y|do(x)) is said to be transportable from Π to Π in D if R(Π ) is uniquely computable from P, P , I in any model that induces D.

D. Proof for Proposition 5.1

Before we proceed to the proof of Proposition 5.1, we first provide some useful definitions and lemma.

Definition D.1. (d-separation): A set S of nodes is said to block a path p if either 1. p contains at least one arrow-emitting node that is in S,

Learning Invariant Causal Mechanism from Vision-Language Models

2. p contains at least one collision node that is outside S and has no descendant in S. If S blocks all paths from set X to set Y , it is said to d-separate X and Y . X and Y are independent given S, written X Y |S.

The d-separation reflects the conditional independencies that hold the distribution P that is compatible with the DAG. Definition D.2. (Rules of do-calculus) Let X, Y, Z, W be arbitrary disjoint sets of nodes in a causal DAG G. We denote by GX the graph obtained by deleting from G all arrows pointing to nodes in X. Likewise, we denote by GX the graph obtained by deleting from G all arrows emerging from nodes in X. The notation GXZ represents the deletion of both incoming and outgoing arrows. 1. Insertion/deletion of observations.

P(y|do(x), z, w) = P(y|do(x), w) if (Y Z|X, W)GX.

2. Action/observation exchange.

P(y|do(x), do(z), w) = P(y|do(x), w) if (Y Z|X, W)GXZ.

3. Insertion/deletion of actions.

P(y|do(x), do(z), w) = P(y|do(x), w) if (Y Z|X, W)GXZ(W ).

where Z(W) is the set of Z-nodes that are not ancestors of any W-node in GX. Definition D.3. (Trivial transportability). A causal relation R is said to be trivially transportable from Π to Π , if R(Π ) is identifiable from (G , P ).

R is trivially transportable if we can directly estimate R(Π ) from observational data of Π , unaided by the causal information from Π. The following state the sufficient and necessary conditions of transportability of average causal effect P (y|do(x)). Definition D.4. (S-admissibility). A set Z of variables satisfying (Y S|Z, X) in DX will be called S-admissible with respect to the causal effect of X on Y . DX denote deleting all arrows pointing to node X in D. Lemma D.5. (sufficient and necessary conditions of transportability (Pearl & Bareinboim, 2014)) The average causal effect P (y|do(x)) is transportable from Π to Π if either one of the following conditions holds: 1. P (y|do(x)) is trivially transportable. 2. There exists a set of covariates Z (possibly affected by X) such that Z is S-admissible and for which P (z|do(x)) is transportable. 3. There exists a set of covariates, W that satisfy (X Y |W)X(W ) and for which P (w|do(x)) is transportable.

Proof. 1. According to Definition D.3, the causal relationship can be directly estimated from observational data of Π . 2. If condition 2 holds, it implies:

P (y|do(x)) = P(y|do(x), s) (10)

z P(y|do(x), z, s)P(z|do(x), s)

z P(y|do(x), z)P (z|do(x))

The transportability of P(z|do(x)) reduces P (z|do(x)) to a star-free expression and therefore P (y|do(x)) is transportable. 3. If condition 3 holds, it implies:

P (y|do(x)) = P(y|do(x), s) (11)

w P(y|do(x), w, s)P(w|do(x), s)

w P(y|w, s)P (w|do(x))

According to Rule 3 of Definition D.2, the transportability of P (w|do(x)) would render P(w|do(x), s) to a star-free expression and render P (y|do(x)) transportable. This ends the proof.

Learning Invariant Causal Mechanism from Vision-Language Models

We are now prepared to present the proof for Proposition 5.1 in the main text: Proposition D.6. Let P ( ) := P( |e ) denote the distribution in test environment E = e . The causal mechanisms are formalized through interventional distributions. We focus on the predictions P(y|do( )), where do( ) represents an intervention. The causal mechanism satisfies P (y|do(x)) = P(y|do(x)). When using only Zinv, the causal mechanism remains unchanged: P (y|do(zinv)) = P(y|do(zinv)).

Proof. The causal mechanism P(y|do(x)) can be decomposed as:

P(y|do(x)) = X

zvar P(y|do(x), zvar, zinv)

P(zvar|do(x))P(zinv|do(x))

Similarly, the causal mechanism P (y|do(x)) can be decomposed as:

P (y|do(x)) = P(y|do(x), e ) (12)

zvar P(y|do(x), zvar, zinv, e )

P(zvar|do(x), e )P(zinv|do(x), e )

From the Rule 1 of Definition D.2, we have

P(y|do(x), zvar, zinv, e ) = P(y|do(x), zvar, zinv), (13)

because Zvar satisfies (Y E|Zvar, X) in DX, and according to Definition D.4, the Zvar E-admissible with respect to the causal effect of X on Y . Then, since (Y X|Zvar, Zinv), the following holds:

P(y|do(x), zvar, zinv) = P(y|zvar, zinv) (14)

According to Definition D.1, Zinv Y Zvar E is a collision node:

P(zinv|do(x), e ) = P(zinv|do(x)). (15)

And since there is no other path from X Zinv and X Zvar, the dooperator is trivial. Therefore, the Equation (12) can be rewrite as: P (y|do(x)) = X

zinv P(y|zvar, zinv)P(zinv|x)P (zvar|x) (16)

Since P (zvar|x) = P(zvar|x), P (y|do(x)) = P(y|do(x)).

If the prediction is solely based on Zinv, Y is independent of Zvar. The prediction process is now P (y|do(zinv)) = P(y|do(zinv), e ). Since the do( ) cut-off all edges entering Zinv, Zinv is independent of E. Therefore, P (y|do(zinv)) = P(y|do(zinv)). This ends the proof.

E. Proof for Proposition 5.3

Building upon prior works (Roeder et al., 2021; Ahuja et al., 2022; Hyvarinen & Morioka, 2016; Zimmermann et al., 2021), we offer proof demonstrating that CLIP identifies latent factors up to an invertible linear transformation, and we also reuse some of their proof techniques. Proposition E.1. For an image encoder f I : X ˆZ and its corresponding text encoder f T : T ˆZ, where ˆZ RD is the embedding space. If Condition 5.2 is satisfied, then f I(x) identifies Z up to an invertible linear transformation A, i.e. f I(x) = Ag 1(x) = Az.

Proof. Consider a training dataset D = {(xi, ti)}N i=1 sampled from the joint distribution p(x, t). Let T denote the set of all possible values of t. Let θ denote the parameters of f I and f T , and let θ denote the parameters of f I and f T (to which we have no access). The ground-truth conditional probability can be regarded as produced by f I and f T :

pθ (t | x, T ) = exp(f I (x) f T (t)) P

t T exp(f I (x) f T (t )). (17)

Learning Invariant Causal Mechanism from Vision-Language Models

Similarly, the CLIP model functions f I and f T produce the distribution:

pθ(t | x, T ) = exp(f I(x) f T (t)) P

t T exp(f I(x) f T (t )). (18)

The training objective for CLIP is to minimize the KL divergence KL pθ(t | x, T ) pθ (t | x, T ) . Ideally, after training,

we have pθ(t | x, T ) = pθ (t | x, T ), that is:

exp(f I(x) f T (t)) P

t T exp(f I(x) f T (t )) = exp(f I (x) f T (t)) P

t T exp(f I (x) f T (t )). (19)

The above equality illustrates the consistency aspect of Condition 5.2. Building on this, for any pair ta and tb the following ratio should hold: pθ(ta | x, T ) pθ(tb | x, T ) = pθ (ta | x, T )

pθ (tb | x, T ) . (20)

This implies that under Condition 5.2 there exist D + 1 distinct text description pairs (ta, tb) satisfying:

exp(f I(x)T f T (ta)) exp(f I(x)T f T (tb)) = exp(f I (x)T f T (ta))

exp(f I (x)T f T (tb)) . (21)

Taking the logarithm of both sides, this simplifies to:

(f T (ta) f T (tb))T f I(x) = (f T (ta) f T (tb))T f I (x). (22)

On the left-hand side of Equation (22), (f T (ta) f T (tb)) can be treated as a basis vector. Since there exist D + 1 pairs of (ta, tb), at least D linearly independent vectors (f T (ta) f T (tb)) can form an invertible matrix L RD D. The same holds for the right hand side, where (f T (ta) f T (tb)) forms another matrix L . Substituting these into the equation gives the following system of D linear equations:

f I(x) = (L L 1)T f I (x). (23)

Defining A = L L 1, which is invertible, we arrive at:

f I(x) = Af I (x) = Az (24)

This completes the proof.

F. Proof for Proposition 5.5

The proof technique of Proposition 5.5 follows the approach of Ahuja et al. (2023), which we adapt to our setting.

Proposition F.1. For the encoder f I satisfies Proposition 5.3, if Condition 5.4 is satisfied and sample from P do(zinv)(x) is possible. Then for any pair (xdo(zinv) 1 , xdo(zinv) 2 ) sampled from P do(zinv)(x). The mapping Ainv satisfies:

Ainv(f I(xdo(zinv) 1 ) f I(xdo(zinv) 2 )) = 0 (25)

Proof. Condition 5.4 states that sampling from P do(zinv)(x) is available. The invariant factors of the images from P do(zinv)(x) are fixed at a specific value, i.e., zinv = z . For a sample xdo(zinv) 1 , its ground-truth latent representation, zdo(zinv) 1 = g 1(xdo(zinv) 1 ), can be written as zdo(zinv) 1 = [z , zvar,1]T . Similarly, for another sample xdo(zinv) 2 , we have zdo(zinv) 2 = [z , zvar,2]T .

From Proposition 5.3, the following relationships hold:

f I(xdo(zinv) 1 ) = Azdo(zinv) 1 ,

A 1f I(xdo(zinv) 1 ) = zdo(zinv) 1 , Ainv Avar

f I(xdo(zinv) 1 ) = z

Learning Invariant Causal Mechanism from Vision-Language Models

The same holds for xdo(zinv) 2 . By focusing only on Ainv, we have:

Ainvf I(xdo(zinv) 1 ) = z , (27)

and Ainvf I(xdo(zinv) 2 ) = z , (28)

Subtracting these two equations yields:

Ainv(f I(xdo(zinv) 1 ) f I(xdo(zinv) 2 )) = 0 (29)

This ends the proof.

G. Proof for Theorem 5.6

Before we proceed to the proof of Theorem 5.6, we give the following definition. Let etr Eall denote the training environment. Define the distribution on some test environments ete Eall, etr = ete as Pte. Let Re(h, h ) = Ez P (z)(ℓ(h(z), h (z))) be the expected disagreement between two hypotheses h, h H, where ℓis some loss function (e.g. cross-entropy loss). This represents a measure of how much two hypotheses disagree with each other on the training distribution. We use H H-divergence (Ben-David et al., 2010; Chuang et al., 2020) to measure whether there is any pair of hypotheses whose risk differs significantly between Ptr and Pte.

Definition G.1. (H H-divergence (Ben-David et al., 2010)) Given two distribution Ptr and Pte, and a hypothesis class H, the H H-divergence between Ptr and Pte is:

d H H(Ptr, Pte) := sup h,h H |Retr(h, h ) Rete(h, h )|. (30)

Next, we are going to show how the risk in etr can be related to a test environment ete Eall with the H H-divergence.

Lemma G.2. (Ben-David et al., 2010 (Ben-David et al., 2010)) For all hypothesis h H, the risk on Pte is bounded as:

Rete(h) Retr(h) + d H H(Ptr, Pte) + λH, (31)

where λH is the best joint risk: λH := inf h H[Retr(h ) + Rete(h )]/2. (32)

Proof. By the definition of d H H(Ptr, Pte),

d H H(Ptr, Pte) = sup h,h H |Retr(h, h ) Rete(h, h )| (33)

Also, with the triangle inequality for classification error:

Rete(h) Rete(h ) + Rete(h, h )

Rete(h ) + Retr(h, h ) + |Retr(h, h ) Rete(h, h )|

Rete(h ) + Retr(h, h ) + d H H(Ptr, Pte)

Rete(h ) + Retr(h) + Retr(h ) + d H H(Ptr, Pte)

Retr(h) + d H H(Ptr, Pte) + λH.

This ends the proof.

Next, we formalize some basic assumptions:

Assumption G.3. 1. Retr(hinv) is constant for e Eall.

2. The Bayes-optimal predictor h H and h inv H exist.

3. The loss ℓ: Y Y R+ is L-Lipschitz and bounded by M > 0.

Learning Invariant Causal Mechanism from Vision-Language Models

Now we can proof Theorem 5.6 from the main text.

Theorem G.4. Let H be a hypothesis class over X Y, and let Eall denote a set of environments with distributions {P e}e Eall over X Y. Let hinv H be a predictor relying solely on Zinv, and h H a predictor utilizing both Zinv and Zvar. If the mutual information between Zinv and Z satisfies:

I(Zinv; Z) > c for some constant c > 0. (35)

Then, the worst-case OOD risk strictly satisfies:

ROOD(hinv) < ROOD(h), (36)

Proof. Step 1: Controlling In-Environment Risk via I(Zinv; Z)

The mutual information condition I(Zinv; Z) > c guarantees that Zinv captures most predictive information in Z. By the data processing inequality, I(Y ; Zinv) I(Y ; Z) ϵ(c) (37)

where ϵ(c) 0 as c . This implies the conditional entropy gap between h inv and h is bounded:

H(Y | Zinv) H(Y | Z) = I(Y ; Zvar | Zinv) ϵ(c). (1)

For the L-Lipschitz loss, the risk gap becomes:

Retr(h inv) Retr(h ) + L ϵ(c). (38)

By realizability, hinv approximates h inv, so:

Retr(hinv) Retr(h inv) + η Retr(h ) + L ϵ(c) + η, (39)

or small η > 0.

Step 2: Bounding OOD Risk via H H-Divergence

By Definition G.1, the H H-divergence between Pe and Pe is:

d H H(Pe, Pe ) = sup h,h H |Retr(h, h ) Rete(h, h )| . (40)

For hinv, invariance ensures minimal divergence:

d H H(Pe, Pe ; hinv) δinv, (41)

while for h, the divergence scales with Zvar-shift:

d H H(Pe, Pe ; h) δvar δinv. (42)

Step 3: Worst-Case Risk Comparison

By definition, ROOD(h) = max{Re(h), Rete(h)}. From (2) and (3):

Rete(h) Re(h) + δvar + λH ROOD(hinv) L ϵ(c) η + δvar + λH. (43)

Since the risk of hinv is stable (ROOD(hinv) = Re(hinv)), we have:

ROOD(h) Rete(h) ROOD(hinv) L ϵ(c) η + δvar + λH. (44)

Under I(Zinv; Z) > c, ϵ(c) becomes negligible. With δvar > 0 and λH > 0, it follows that:

ROOD(hinv) < ROOD(h). (45)

This ends the proof.

Learning Invariant Causal Mechanism from Vision-Language Models

H. Implementation of CLIP-ICM

H.1. Interventional Data Collection

In Section 6, we propose two approaches for collecting interventional data. In the following sections, we provide a detailed explanation of each method.

Image-based Interventional Data Collection To generate interventional data, we employ the following data augmentation techniques: Color Jitter, Gray Scale, Gaussian Blur, Random Invert, Random Rotation, Random Posterize, Random Solarize, and Random Equalize. After applying these methods, the changes in the latent factors of the corresponding images are summarized in Table 4. In practice, for each image, we randomly select one augmentation technique from the set and generate an augmented sample, which is then paired with the original sample.

Table 4. Impact of data augmentation on specific latent factors (Zinv: , Zvar: )

Augmentation Technique Shape Structure Color Texture

Color Jitter Gray Scale Gaussian Blur Random Invert Random Posterize Random Solarize Random Equalize Random Rotation

Text-based Interventional Data Collection To collect text-based interventional data, two components are required: a image description model and a text intervention model. We utilize GPT-4o (Open AI, 2023) for both purposes. The text description model takes an image as input, processes it using a carefully crafted prompt, and outputs a descriptive caption of the image. The text intervention model then takes this caption as input, applies a specified intervention using a modified prompt, and outputs an altered version of the original description.

Prompt for image description : Describe this image in detail.

Prompt for text intervention :

You are tasked with creating interventional descriptions of images for a classification task. Given the following JSON input:

"image_description": "A detailed textual description of the image, including all features (relevant and irrelevant to the classification task).",

"class_labels": ["Class1", "Class2", "Class3"]

Your goal is to:

1. Identify features related to the classification task that are invariant to environmental shifts. These features should remain unchanged in the output description.

2. Intervene on all other features (e.g., texture, background, lighting) by replacing their description with new, plausible alternatives.

3. Maintain the original structure of the description while altering only the specified features.

Learning Invariant Causal Mechanism from Vision-Language Models

Output a single JSON object containing only the "intervened_description" , which reflects these modifications.

We provide examples of both descriptions for samples from the PACS dataset in Figure 5.

This image depict a white dog with black or dark gray spots. The dog is shown in profile, facing to the right, with a focus on its head and neck. The background is painted in a neutral, earthy tone, possibly brown, which contrasts gently with the light coloration of the dog's fur. The texture of the painting suggests a traditional medium such as oil or acrylic on canvas, with visible brushstrokes that add depth and dimension to the fur and background. The lighting accentuates the contours of the dog's head, highlighting its structure and features, such as the ears, muzzle, and eyes, in a subtle and naturalistic manner.

This image depicts a white dog with black or dark gray spots. The dog is shown in profile, facing to the right, with a focus on its head and neck. The background is painted in a vibrant, sky-blue tone, which contrasts gently with the light coloration of the dog's fur. The texture of the painting suggests a modern medium such as digital art, with smooth transitions that add depth and dimension to the fur and background. The lighting accentuates the contours of the dog's head, highlighting its structure and features, such as the ears, muzzle, and eyes, in a bright and slightly stylized manner.

This image is a painting of a dog, focusing closely on the dog's face and part of its neck. The artwork employs a vibrant and stylized color palette, with prominent shades of blue, teal, and purple blending together to create an almost surreal effect. The dog's left ear (on the right side of the image) is painted in warm tones of brown and red, contrasting with the cooler blue hues on the rest of the face. The background is minimal and light-colored, which helps emphasize the vivid colors of the dog. The use of bold and expressive brushstrokes contributes to the artistic and abstract quality of the piece, highlighting the painter's creative interpretation of the subject.

This image is a painting of a dog, focusing closely on the dog's face and part of its neck. The artwork employs a rich and imaginative color palette, with prominent shades of green, gold, and pink blending together to create an almost ethereal effect. The dog's left ear (on the right side of the image) is painted in cool tones of gray and silver, contrasting with the warmer golden hues on the rest of the face. The background is dark and textured, which helps emphasize the vibrant colors of the dog. The use of smooth and dynamic brushstrokes contributes to the artistic and modern quality of the piece, showcasing the painter's innovative interpretation of the subject.

Figure 5. The example images with image description and the interventional description from the PACS dataset.

H.2. Details in Estimating Ainv

The process begins with collecting ˆZ and ˆZdo(zinv). For any given image sample x, we first generate an augmented version α(x) using data augmentation. We then obtain the corresponding text description t using a text description model and create an intervened description β(t) using a text intervention model. This allows us to construct four embedding pairs: (f T (t), f T (t) f T (β(t))), (f T (t), f T (t) f I(α(x))), (f I(x), f I(x) f I(α(x))), and (f I(x), f I(x) f T (β(t))). These pairs are stacked across all samples to form ˆZ and ˆZ ˆZdo(zinv).

Once the data is collected, the next step is to fit Ainv. One approach is to use principal component analysis (PCA) to compute Ainv that satisfies Equation (9). Another approach is to treat Ainv as a learnable matrix and optimize it directly using gradient descent.

H.3. Pseudo Code

The pseodo code of CLIP-ICM is illustrated in Algorithm 1.

H.4. Computation Cost and Data Collection Cost

All experiments are conducted on a single NVIDIA-RTX A6000 GPU, and the average training duration for each result is less than one hour.

Computation Complexity Analysis We provide a detailed analysis of the computational complexity of CLIP-ICM.

Learning Invariant Causal Mechanism from Vision-Language Models

Algorithm 1 CLIP-ICM Require: CLIP image encoder f I, text encoder f T and images from training domain x D, D = {De}e Etr. K types of data augmentation A = {αk}K k=1. Hyperparameter λ and Dinv. Text description of all images x D. The text intervention model β. 1: Initialize the container ˆZ = [ ] for original representation 2: Initialize the container ˆZ = [ ] for intervened representation 3: for x in D do 4: Sample random augmentation αk from K 5: Get the text description t of x. 6: Get the original image representation ˆz = f I(x) 7: Get the original text representation ˆzt = f T (t) 8: Get xdo(zinv) = αk(x) 9: Get tdo(zinv) = β(t) 10: Get the intervened representation ˆz = f I(αk(x)) 11: Get the intervened representation ˆz t = f T (β(t)) 12: Append ˆz into ˆZ // image pair 13: Append ˆz into ˆZ

14: Append ˆzt into ˆZ // text pair 15: Append ˆz t into ˆZ

16: Append ˆz into ˆZ // image-text pair 17: Append ˆz t into ˆZ

18: Append ˆzt into ˆZ // text-image pair 19: Append ˆz into ˆZ

20: end for 21: Get the correlation matrix Corr = ˆZT ˆZ λ ( ˆZ ˆZ )T ( ˆZ ˆZ ). ˆZ RN d, ˆZ RN d, Corr Rd d

22: Perform Eigenvalue Decomposition for Corr to get U, V, Σ 23: Get the highest Dinv singular vector U:Dinv,: RDinv D

Output: The target transform Ainv is U:Dinv,:

According to Algorithm 1, the training process of CLIP-ICM can be divided into four key steps: data preprocessing, obtaining the representation of the training data, calculating Corr, and obtaining the projection matrix through SVD. Let N denote the size of the training dataset and D the dimension of the representation:

1. Data Preprocessing: Since data augmentation is applied to individual samples, the computational complexity for this step is O(N).

2. Obtaining Representations: The pre-trained network is applied to individual samples. Like the previous step, this operation also has a complexity of O(N).

3. Calculating the Corr: Let Z RN D be the matrix containing all training data representations. The complexity of computing ZT Z is O(D2N). Since D is the dimension of representation and is independent of the dataset size, it can be considered constant, the complexity is also O(N).

4. SVD of the Correlation Matrix: The computational complexity of SVD for the correlation matrix is O(D3). However, since the matrix Corr RD D is independent of the dataset size, this step s complexity can be considered as O(1).

In summary, CLIP-ICM has a computational complexity of O(N), growing linearly with dataset size. For larger datasets, Ainv is treated as a learnable matrix and optimized via gradient descent, requiring only a linear matrix in RD Dinv. In both approaches, CLIP-ICM remains highly efficient.

Data Collection Cost We use GPT-4o as the generator for text-based interventional data, which involves three steps: processing image inputs, generating textual descriptions, and producing intervention-modified descriptions. For smaller datasets such as PACS, VLCS, Terra Incognita, and Office-Home, we generate interventional text data for all image samples. For large datasets such as Domain Net and Image Net, we sample 20% of the dataset for text-based interventions. Data is collected using the GPT-4o batch API at a cost of $1.25 per million input tokens and $5 per million output tokens. The total cost for data collection is $162.79.

Learning Invariant Causal Mechanism from Vision-Language Models

I. OOD Performance in Terra Incognita

We evaluate CLIP-ICM using the same experimental setup as described in Section 4, with results presented in Table 5. The results show that CLIP-ICM Linear-Probe improves performance by an average of 6.3% compared to direct Linear-Probe. Furthermore, CLIP-ICM significantly outperforms naive fine-Tune in new classes under both open-class and domain shift conditions.

Table 5. The performance of CLIP on the Terra Incognita dataset in accuracy (%). L100, L38, L43, and L46 represent different environments. In the Linear probe case, the values in ( ) represent the accuracy achieved by directly fine-tuning with the linear probe on the target domain.

DOMAIN SHIFT

METHOD L100 L38 L43 L46 AVG

LINEAR-PROBE 73.6 (90.9) 58.3 (86.3) 61.0 (76.0) 47.8 (78.9) 60.2 (83.0) CLIP-ICM LINEAR-PROBE 79.8 ( 6.2) 61.5 ( 3.2) 66.9 ( 5.9) 57.9 ( 10.1) 66.5 ( 6.3)

OPEN CLASS UNDER DOMAIN SHIFT

SPLIT METHOD L100 L38 L43 L46 AVG

BASE ZERO-SHOT 56.4 23.2 30.4 25.8 33.9 FINE-TUNE 75.6 58.5 65.2 55.1 63.6 CLIP-ICM 63.7 52.6 58.8 51.2 56.6

NEW ZERO-SHOT 47.6 17.6 35.2 37.4 34.5 FINE-TUNE 36.7 11.2 25.1 25.5 24.6 CLIP-ICM 59.4 43.2 51.6 46.8 50.2

TOTAL ZERO-SHOT 52.0 20.4 32.8 31.6 34.2 FINE-TUNE 56.2 34.9 45.2 40.3 44.2 CLIP-ICM 61.6 47.9 55.2 49.0 53.4

J. More Results

We also utilize Image Net as the training data and validated the OOD generalization ability of CLIP-ICM on four datasets: Image Net V2 (Recht et al., 2019), Image Net-A (Hendrycks et al., 2021b), and Image Net-Sketch (Wang et al., 2019), Image Net R (Hendrycks et al., 2021a). Specifically:

Image Net V2 (Recht et al., 2019) contains new test data for the Image Net benchmark, providing a fresh evaluation of model performance.

Image Net-A (Hendrycks et al., 2021b) consists of real-world, unmodified, and naturally occurring examples that are often misclassified by Res Net models, presenting challenging cases for OOD generalization.

Image Net-Sketch (Wang et al., 2019) includes sketch representations of Image Net classes, offering a test of the model s ability to generalize to different visual styles.

Image Net-R (Hendrycks et al., 2021a) contains art, cartoons, deviantart, graffiti, embroidery, graphics, origami, paintings, patterns, plastic objects, plush objects, sculptures, sketches, tattoos, toys, and video game renditions of Image Net classes.

We additionally conducted experiments on the i Wild Cam-WILDS 2020 dataset (Koh et al., 2021). i Wild CAM comprises 203,029 images from 182 different animal species, which were collected from 323 camera traps distributed across various locations. The images obtained from different locations exhibit variations in lighting, color, camera angle, background, vegetation, and relative animal frequencies. We follow the setting in (Koh et al., 2021), use images from 243 locations as the training domain, and those from 48 other locations as the test domain. We report the average macro F1 score of CLIP, CLIP-ICM , CLIP-ICM , and CLIP-ICM under both ID and OOD conditions, as shown in the Table 7: From the results in Table 7, we can observe that regardless of type of interventional data, CLIP-ICM consistently outperforms CLIP in both ID and OOD settings.

Learning Invariant Causal Mechanism from Vision-Language Models

Table 6. Accuracy on Image Net with various domain shifts.

METHOD IN-DISTRIBUTION OUT-OF-DISTRIBUTIONS

IMAGENET IMAGENET-V2 IMAGENET-S IMAGENET-A IMAGENET-R AVG.

ZERO-SHOT 66.7 60.8 46.1 47.8 74.0 57.2 FINE-TUNE 68.2 0.1 61.9 0.1 46.8 0.1 46.4 0.1 75.1 0.1 57.6 0.1 COOP (ZHOU ET AL., 2022B) 71.5 64.2 48.0 49.7 75.2 59.3 COCOOP (ZHOU ET AL., 2022A) 71.0 64.2 48.8 50.6 76.2 59.9 CLIPOOD (SHU ET AL., 2023) 71.6 0.1 64.9 0.1 49.3 0.1 50.4 0.1 77.2 0.1 60.4 0.1

CLIP-ICM 70.5 0.3 64.7 0.5 49.9 0.4 50.3 0.3 76.2 0.2 60.2 0.4 CLIP-ICM 71.2 0.4 63.7 0.3 48.8 0.4 46.8 0.1 74.8 0.2 58.5 0.3 CLIP-ICM 71.6 0.1 65.8 0.3 50.9 0.4 51.4 0.3 77.6 0.2 61.4 0.3

Table 7. Accuracy on the i Wild Cam-WILDS 2020 dataset. Method ID (48 Locations) OOD (243 Locations)

CLIP 14.2 10.6 CLIP-ICM 15.6 13.3 CLIP-ICM 15.2 12.2 CLIP-ICM 15.8 14.1

CLIP Linear-Probe 54.6 41.4 CLIP-ICM Linear-Probe 56.2 42.2 CLIP-ICM Linear-Probe 55.6 44.3 CLIP-ICM Linear-Probe 57.1 46.1

K. Dataset Details from Domain Bed Benchmark

DOMAINBED includes downloaders and loaders for seven multi-domain image classification tasks:

PACS (Li et al., 2017) comprises four domains e {art, cartoons, photos, sketches}. This dataset contains 9,991 examples of dimension (3, 224, 224) and 7 classes.

VLCS (Fang et al., 2013) comprises photographic domains e {Caltech101, Label Me, SUN09, VOC2007}. This dataset contains 10,729 examples of dimension (3, 224, 224) and 5 classes.

Office-Home (Venkateswara et al., 2017) includes domains e {art, clipart, product, real}. This dataset contains 15,588 examples of dimension (3, 224, 224) and 65 classes.

Terra Incognita (Beery et al., 2018) contains photographs of wild animals taken by camera traps at locations e {L100, L38, L43, L46}. Our version of this dataset contains 24,788 examples of dimension (3, 224, 224) and 10 classes.

Domain Net (Peng et al., 2019) has six domains e {clipart, infograph, painting, quickdraw, real, sketch}. This dataset contains 586,575 examples of size (3, 224, 224) and 345 classes.

For all datasets, we first pool the raw training, validation, and testing images together. For each random seed, we then instantiate random training, validation, and testing splits.

L. More comparison with related works

L.1. Invariant Risk Minimization (Arjovsky et al., 2020)

Invariant Risk Minimization (IRM) and IRM-based methods (Ahuja et al., 2021; Yang et al., 2023) also aim to minimize the OOD risks with an invariant predictor. However, note that these methods obtain invariant predictors without identifying the invariant factors. Moreover, as suggested in Figure 2 of Arjovsky et al. 2020 (Arjovsky et al., 2020), there are multiple solutions to the IRM objective. Therefore finding an IRM solution doesn t mean the identification of Zinv. Compared to these methods, our method ensures the identification of Zinv.

Learning Invariant Causal Mechanism from Vision-Language Models

L.2. Model-based Domain Generalization (Robey et al., 2021)

Model-based Domain Generalization (Robey et al., 2021) also also proposes an SCM, their analysis method is different from ours. They believe that in domain shift, an instance Xe is generated by an underlying random variable X and e jointly passing through a domain-transfer model G(X, e). In our causal diagram analysis, we believe that the sample is generated by Zvar and Zinv jointly.

MBDG obtains the invariant features by utilizing a pre-trained domain-transfer model to constrain the distance between the features extracted by the feature extractor and the representations generated by the domain-transfer model.

In contrast, our approach learns a linear transformation of the pre-trained representation to obtain features that are invariant to data augmentation. We provide the comparison results between our method and MBDG in Table 15 and Table 12.

L.3. Subspace Identification Guarantee (Li et al., 2024)

From the perspective of SCM analysis, SIG views the data generation process through 4 kinds of latent factors. In our SCM, Zinv can be regarded as a domain-invariant variable, and Zvar can be regarded as a domain-variant variable. However, in our causal diagram, we do not distinguish between label-irrelevant and label-relevant variables in our modeling.

From a methodological perspective, SIG uses an end-to-end, reconstruction-based network, while we use a pre-trained CLIP backbone network. SIG uses a variation-inference method to identify latent factors, while we use an intervention method to identify latent factors. We present the performance of our method alongside the SIG to the Office-Home dataset in Table 13.

L.4. Domain Prompt Learning (Zhang et al., 2023b)

(Zhang et al., 2023b) introduces Domain Prompt Learning (DPL), a novel approach that improves the domain generalization performance of the CLIP model on various datasets by generating conditional prompts, achieving higher accuracy without fine-tuning the entire model. DPL can be regarded as a Test-Time Adaptation (TTA) approach, while our method can be regarded as a domain generalization approach. Nevertheless, we provide the comparison results in Table 2.

L.5. Self-Supervised Learning with Data Augmentations Provably Isolates Content from Style (Von Kügelgen et al.,

(Von Kügelgen et al., 2021) also interpret data augmentations as counterfactuals, and use them to obtain the invariant factors. Our work differs from theirs in three key aspects: the background setting, the research problem, and the implementation method.

Firstly, the problem studied by (Von Kügelgen et al., 2021) is set within the framework of self-supervised contrastive learning. In their paper, the term "invariant factor" refers to the invariant parts derived from two different augmented perspectives within a positive sample pair. In contrast, the "invariant factor" in our paper refers to the latent factors that remain unchanged across different domains. Due to the diversity of augmentation methods employed in self-supervision (Chen et al., 2020), the invariant factors they investigate do not align with the invariant factors we examine.

Second, our theoretical approach diverges fundamentally from (Von Kügelgen et al., 2021) in both methodology and objectives. While their work establishes that non-linear encoders can achieve block identification of content factors through embedding space alignment of view-augmented samples, our analysis proceeds through two critical theoretical advancements: (1) We first demonstrate that CLIP learns representations that identify all latent factors up to an invertible linear transformation; (2) We subsequently prove that interventional data enables estimation of a linear mapping from the embedding space to invariant subspaces.

Finally, the research objectives of the two works are fundamentally distinct. While (Von Kügelgen et al., 2021) focuses on analyzing Sim CLR s factor identification capabilities, our work specifically addresses CLIP s out-of-distribution generalization limitations by developing linear mappings to invariant subspaces. This difference in both theoretical focus and practical application underscores the novel contributions of our approach.

Learning Invariant Causal Mechanism from Vision-Language Models

Jitter Gray Gaussian Invert Posterize Solarize Equalize

Augmentation 2

Augmentation 1

72.93 73.04 73.37 73.37 72.92 74.09 73.88

73.15 71.57 71.80 72.23 71.47 71.62 72.59

73.55 71.82 73.02 72.74 72.15 72.15 73.02

73.63 72.05 72.81 73.15 72.33 72.51 72.97

72.97 71.41 72.15 72.43 72.05 71.75 72.41

74.40 71.54 72.41 72.76 72.18 72.25 72.74

73.43 72.41 73.02 73.07 72.46 72.43 73.27

1e1 1e2 1e3 1e4 1e5 1e6

linear-probe Accuracy(%)

100 200 300 400 500

linear-probe Accuracy(%)

0 1 2 3 Text

70.1 73.6 73.7 73.9

74.7 75.6 75.6 75.6

74.7 75.6 75.7 75.7

74.8 75.6 75.7 75.8

Number of interventional data per sample

Figure 6. The experimental results for ablation studies. (a) The zero-shot accuracy of CLIP-ICM on the Label Me domain of the VLCS dataset with different combinations of augmentations, and brighter colors indicate a higher accuracy. (b) The linear-probe accuracies of CLIP-ICM on the VLCS dataset with different λ. (c) The accuracy of CLIP-ICM on VLCS with different choice of Dinv. (d) The accuracy of CLIP-ICM on the Label Me domain of the VLCS dataset with different numbers of interventional data pairs.

M. Ablation Study

M.1. The impact of different data augmentation

We investigate how CLIP-ICM performs with varying data augmentation strategies on the VLCS dataset. In this experiment, only image-based interventional data are utilized. The experiments include different combinations of augmentation techniques, and the prediction accuracy on the Label Me domain is illustrated in Figure 6(a). The diagonal of Figure 6(a) indicates scenarios where only one augmentation is applied, while other cells represent combinations of data augmentation techniques. It is noteworthy that the use of data augmentation significantly improves performance, with certain combinations showing even more promising results.

M.2. The impact of hyperparameter λ.

We conduct experiments on the VLCS dataset and visualize the results in Figure 6(b). In this experiment, only image-based interventional data are utilized. From the results, we can observe that λ has a significant impact on the prediction results. This is because that higher value of λ indicates the second term of Equation (9) is more important and has a higher impact.

Learning Invariant Causal Mechanism from Vision-Language Models

These results illustrate the effectiveness of our proposed method.

M.3. Ablation study on dimension of Zinv

As we mentioned at Section 6, the dimension Dinv of Zinv is a hyperparameter. To investigate the influence of Dinv, we conduct an ablation study regarding the choice of Dinv. All results are the performance CLIP-ICM with only image-based interventional data utilized, conducted on the VLCS dataset with Vi T-B/16 as the backbone, where Dinv < 512. The results are shown in Figure 6(c).

From Figure 6(c) we can see that as the Dinv increases, the accuracy first increases, stabilizes, slowly decreases, and finally drops. Therefore, the optimum dimension of Dinv should be around 300 to 350 for all domains in VLCS.

M.4. Number of Interventional data pairs.

By default, each sample from the training environments is paired with an image-based interventional sample via data augmentation and a text-based interventional sample generated by an LLM. In this section, we examine whether increasing interventional data enhances CLIP-ICM s performance, as shown in Figure 6(d). The results indicate that while adding more data pairs slightly improves performance, the gain is marginal.

M.5. The role of Ainv

To validate the role of Ainv, we conduct an ablation study to illustrate its contribution. Specifically, we use our generated image-based interventional data to train a linear probe on the Domain Bed benchmark. The experimental results are presented in Table 8. From the results, we observe that incorporating image-based interventional data into the linear probe leads to

Table 8. Ablation results on the Domain Bed benchmark under domain shift. We compare the standard linear probe, linear probe trained with image-based interventional data, and CLIP-ICM linear probe trained with the proposed Ainv module. Methods with indicate training with only image-based interventional data.

Method PACS VLCS Office Home Terra Inc Domain Net AVG.

Linear-Probe 96.4 78.7 81.9 60.2 55.0 74.4 Linear-Probe + Interventional data 96.8 79.3 82.3 60.5 55.8 74.9 CLIP-ICM Linear-Probe 97.5 86.5 84.6 64.3 64.0 79.0

only marginal improvements. In contrast, although both settings utilize the same interventional data, the performance of CLIP + Linear-Probe is significantly higher. These findings demonstrate that our proposed Ainv module substantially enhances the OOD generalization ability of CLIP.

N. An analysis of the zero-shot prediction with domain prompt

In experiments from Section 7, we use the standard text prompt template when performing zero-shot prediction, i.e. "A photo of [cls]" where [cls] is the name of the corresponding class.

To explore the influence of text prompts on zero-shot prediction, we conducted experiments using the PACS, VLCS, and Office-Home datasets. In these experiments, we employed text prompts that incorporate domain information. For datasets where the domain names hold particular significance, such as PACS and Office-Home, we utilized the prompt A [domain] of [cls] . In contrast, for the VLCS dataset, where the domain names do not have specific meanings, we adopted the prompt A photo of [cls] in [domain] . Here, [domain] represents the name of the test domain. The results are illustrated in Table 9, Table 10 and Table 11.

It is important that the objective of domain generalization is to evaluate the performance of a model on unseen domains. Therefore, incorporating domain information, such as (a sketch of [cls]) would not align with the requirements of the domain generalization task. We only conduct this experiment for exploration.

From the results in Table 9, we find that only in some domains, incorporating domain information in the prompt template can improve the zero-shot performance, such as Sketch in PACS, and Clipart in Office-Home. However, in most domains, where the domain name doesn t provide any information, or can t describe the domain properly, the performance is lower

Learning Invariant Causal Mechanism from Vision-Language Models

than the original template.

Another important observation is that CLIP-ICM consistently improves the performance of CLIP in most domains, without any domain-related information or other kinds of template design.

Table 9. Comparison results of zero-shot performance of CLIP with domain prompt in PACS dataset. denotes that the results of CLIP with domain prompt are higher than standard prompt, while denotes that the results of CLIP with domain prompt are lower than standard prompt.

Algorithm Photo Art Cartoon Sketch Avg

CLIP 100.0 97.4 99.2 88.1 96.1 CLIP + Domain Prompt 100.0 97.4 99.0 90.2 96.6 CLIP-ICM 99.1 99.5 99.7 92.2 97.7

Table 10. Comparison results of zero-shot performance of CLIP with domain prompt in VLCS dataset. denotes that the results of CLIP with domain prompt are higher than standard prompt, while denotes that the results of CLIP with domain prompt are lower than standard prompt.

Algorithm Caltech101 Label Me SUN09 VOC2007 Avg

CLIP 99.9 70.1 73.5 86.1 82.4 CLIP + Domain Prompt 99.2 69.7 65.1 80.3 78.5 CLIP-ICM 100.0 75.6 79.2 90.0 86.2

Table 11. Comparison results of zero-shot performance of CLIP with domain prompt in Office-Home dataset. denotes that the results of CLIP with domain prompt are higher than standard prompt, while denotes that the results of CLIP with domain prompt are lower than standard prompt.

Algorithm Art Clipart Product Real Avg

CLIP 83.3 65.3 89.0 89.3 81.7 CLIP + Domain Prompt 82.4 67.0 87.8 88.7 81.4 CLIP-ICM 84.6 70.7 91.7 91.5 84.6

O. Full Results on Domainbed

In this section, we provide the full results on datasets from the Domainbed benchmark.

Learning Invariant Causal Mechanism from Vision-Language Models

Table 12. Accuracy on PACS dataset. A,C,P and S represents different domains.

Algorithm P A C S Avg

Backbone is Res Net-50

ERM 97.2 0.3 84.7 0.4 80.8 0.6 79.3 1.0 85.5 IRM (Arjovsky et al., 2020) 96.7 0.6 84.8 1.3 76.4 1.1 76.1 1.0 83.5 Group DRO (Sagawa et al.) 96.7 0.3 83.5 0.9 79.1 0.6 78.3 2.0 84.4 Mixup (Yan et al., 2020) 97.6 0.1 86.1 0.5 78.9 0.8 75.8 1.8 84.6 MMD (Li et al., 2018) 96.6 0.2 86.1 1.4 79.4 0.9 76.5 0.5 84.6 DANN (Ganin et al., 2016) 97.3 0.4 86.4 0.8 77.4 0.8 73.5 2.3 83.6 ARM (Zhang et al., 2021) 97.4 0.3 86.8 0.6 76.8 0.5 79.3 1.2 85.1 MBDG (Robey et al., 2021) 97.0 80.6 79.3 85.2 85.6 CLIP Linear-probe 99.2 0.7 91.7 0.8 92.5 0.7 82.3 0.2 91.4 CLIP Zero-shot 99.3 0.0 91.0 0.0 93.1 0.0 80.5 0.0 91.0 CLIP-ICM 99.4 0.4 93.6 0.2 99.3 0.5 83.1 0.6 93.8 CLIP-ICM Linear-probe 99.4 0.5 92.8 0.7 93.9 0.4 83.8 0.5 92.5 CLIP-ICM 99.0 0.2 93.3 0.4 98.9 0.1 82.8 0.4 93.8 CLIP-ICM Linear-probe 99.1 0.5 92.5 0.4 93.4 0.3 83.5 0.4 92.5 CLIP-ICM 99.8 0.4 94.2 0.2 99.7 0.4 83.7 0.5 93.8 CLIP-ICM Linear-probe 100.0 0.0 93.4 0.2 94.5 0.4 84.2 0.2 92.5

Backbone is Vi T-B/16

CLIP Zero-shot 100.0 0.0 97.4 0.0 99.2 0.0 88.1 0.0 96.1 CLIP Linear-probe 97.3 0.7 98.4 0.1 99.5 0.4 90.4 0.3 96.4 CLIP-ICM 98.7 0.5 99.1 0.4 99.9 0.1 91.7 0.4 97.3 CLIP-ICM Linear-probe 98.8 0.2 99.4 0.3 99.8 0.8 92.1 0.5 97.5 CLIP-ICM 98.3 0.4 98.6 0.4 99.5 0.3 90.7 0.3 96.8 CLIP-ICM Linear-probe 98.4 0.1 98.9 0.3 99.3 0.1 92.1 0.1 97.2 CLIP-ICM 99.1 0.4 99.5 0.2 99.7 0.3 92.2 0.4 97.7 CLIP-ICM Linear-probe 99.1 0.4 99.9 0.1 99.9 0.1 92.4 0.1 97.8

Learning Invariant Causal Mechanism from Vision-Language Models

Table 13. Accuracy on Office-Home dataset. A,C,P and R represents different domains.

Algorithm A C P R Avg

Backbone is Res Net-50

ERM 61.3 0.7 52.4 0.3 75.8 0.1 76.6 0.3 66.5 IRM (Arjovsky et al., 2020) 58.9 2.3 52.2 1.6 72.1 2.9 74.0 2.5 64.3 Group DRO (Sagawa et al.) 60.4 0.7 52.7 1.0 75.0 0.7 76.0 0.7 66.0 Mixup (Yan et al., 2020) 62.4 0.8 54.8 0.6 76.9 0.3 78.3 0.2 68.1 MMD (Li et al., 2018) 60.4 0.2 53.3 0.3 74.3 0.1 77.4 0.6 66.3 DANN (Ganin et al., 2016) 59.9 1.3 53.0 0.3 73.6 0.7 76.9 0.5 65.9 ARM (Zhang et al., 2021) 58.9 0.8 51.0 0.5 74.1 0.1 75.2 0.3 64.8 SIG (Li et al., 2024) 76.4 63.9 85.4 85.8 77.8 CLIP Zero-shot 71.3 0.0 50.4 0.0 81.7 0.0 82.6 0.0 71.5 CLIP Linear-probe 68.0 0.8 46.3 0.3 80.4 0.9 81.9 0.7 69.1 CLIP-ICM 72.6 0.4 55.0 0.9 83.2 0.3 83.7 0.8 73.6 CLIP-ICM Linear-probe 78.3 0.3 56.4 0.8 88.6 0.8 87.7 0.7 77.8 CLIP-ICM 72.2 0.4 54.6 0.5 82.7 0.2 83.3 0.3 73.2 CLIP-ICM Linear-probe 77.9 0.1 55.9 0.4 88.2 0.5 87.3 0.2 77.3 CLIP-ICM 72.9 0.2 55.4 0.2 83.6 0.1 84.2 0.3 74.0 CLIP-ICM Linear-probe 78.8 0.4 56.7 0.3 89.1 0.4 88.1 0.2 78.2

Backbone is Vi T-B/16

CLIP Zero-shot 83.3 0.0 65.3 0.0 89.0 0.0 89.3 0.0 81.7 CLIP Linear-probe 78.9 0.9 69.3 0.9 90.3 0.3 89.0 0.2 81.9 CLIP-ICM 83.7 0.7 67.4 0.6 90.8 0.9 90.1 0.6 82.6 CLIP-ICM Linear-probe 84.3 0.5 71.4 0.2 92.5 0.2 90.2 0.8 84.6 CLIP-ICM 82.8 0.4 66.8 0.3 89.8 0.2 89.0 0.2 82.1 CLIP-ICM Linear-probe 83.3 0.4 70.5 0.4 91.4 0.4 84.4 0.2 82.4 CLIP-ICM 84.6 0.3 70.7 0.3 91.7 0.3 91.5 0.4 84.6 CLIP-ICM Linear-probe 84.3 0.5 71.4 0.2 92.5 0.2 90.2 0.8 87.1

Learning Invariant Causal Mechanism from Vision-Language Models

Table 14. Accuracy on Terra Incognita dataset.

Algorithm L100 L38 L43 L46 Avg

Backbone is Res Net-50

ERM 49.8 4.4 42.1 1.4 56.9 1.8 35.7 3.9 46.1 IRM (Arjovsky et al., 2020) 54.6 1.3 39.8 1.9 56.2 1.8 39.6 0.8 47.6 Group DRO (Sagawa et al.) 41.2 0.7 38.6 2.1 56.7 0.9 36.4 2.1 43.2 Mixup (Yan et al., 2020) 59.6 2.0 42.2 1.4 55.9 0.8 33.9 1.4 47.9 MMD (Li et al., 2018) 41.9 3.0 34.8 1.0 57.0 1.9 35.2 1.8 42.2 DANN (Ganin et al., 2016) 51.1 3.5 40.6 0.6 57.4 0.5 37.7 1.8 46.7 ARM (Zhang et al., 2021) 49.3 0.7 38.3 2.4 55.8 0.8 38.7 1.3 45.5 CLIP Zero-shot 7.7 0.0 14.8 0.0 32.4 0.0 20.9 0.0 19.0 CLIP Linear-probe 52.0 1.0 34.4 0.6 56.1 0.9 32.8 0.2 44.1 CLIP-ICM 38.8 0.4 33.6 0.6 33.6 0.1 30.1 0.7 34.0 CLIP-ICM Linear-probe 64.1 0.4 46.5 0.7 61.1 0.6 42.8 0.4 53.6 CLIP-ICM 38.3 0.1 32.9 0.4 33.0 0.2 29.4 0.4 33.4 CLIP-ICM Linear-probe 63.5 0.2 45.9 0.5 60.3 0.2 42.0 0.2 52.9 CLIP-ICM 39.6 0.3 34.5 0.5 34.8 0.4 31.2 0.3 35.0 CLIP-ICM Linear-probe 65.3 0.5 47.0 0.2 62.0 0.3 43.4 0.3 54.4

Backbone is Vi T-B/16

CLIPOOD (Shu et al., 2023) 74.6 0.3 57.3 0.5 59.1 0.9 47.7 0.9 59.7 CLIP Zero-shot 52.0 0.0 20.4 0.0 32.8 0.0 31.6 0.0 34.2 CLIP Linear-probe 73.6 0.5 58.3 0.7 61.0 0.1 47.8 1.0 60.2 CLIP-ICM 59.3 0.4 46.2 0.4 52.4 0.7 41.6 0.8 49.9 CLIP-ICM Linear-probe 78.7 0.5 60.2 0.9 66.0 0.8 52.4 0.1 64.3 CLIP-ICM 58.5 0.2 45.2 0.2 51.6 0.2 37.4 0.2 48.2 CLIP-ICM Linear-probe 77.9 0.3 59.1 0.2 65.4 0.1 50.5 0.2 63.2 CLIP-ICM 60.8 0.1 47.3 0.4 53.5 0.1 48.4 0.1 52.5 CLIP-ICM Linear-probe 79.8 0.3 61.5 0.3 66.9 0.3 57.9 0.3 66.5

Learning Invariant Causal Mechanism from Vision-Language Models

Table 15. Accuracy on VLCS dataset.

Algorithm C L S V Avg

Backbone is Res Net-50

ERM 97.7 0.4 64.3 0.9 73.4 0.5 74.6 1.3 77.5 IRM (Arjovsky et al., 2020) 98.6 0.1 64.9 0.9 73.4 0.6 77.3 0.9 78.5 Group DRO (Sagawa et al.) 97.3 0.3 63.4 0.9 69.5 0.8 76.7 0.7 76.7 Mixup (Yan et al., 2020) 98.3 0.6 64.8 1.0 72.1 0.5 74.3 0.8 77.4 MMD (Li et al., 2018) 97.7 0.1 64.0 1.1 72.8 0.2 75.3 3.3 77.5 DANN (Ganin et al., 2016) 99.0 0.3 65.1 1.4 73.1 0.3 77.2 0.6 78.6 ARM (Zhang et al., 2021) 98.7 0.2 63.6 0.7 71.3 1.2 76.7 0.6 77.6 MBDG (Robey et al., 2021) 98.3 68.1 68.8 76.3 77.9 CLIP Zero-shot 99.2 0.0 69.5 0.0 69.8 0.0 84.9 0.0 80.9 CLIP Linear-probe 98.1 0.7 63.8 0.8 79.8 0.7 84.5 1.0 81.5 CLIP-ICM 99.4 0.4 71.8 0.8 71.7 0.8 85.3 1.0 82.1 CLIP-ICM Linear-probe 99.2 0.1 69.9 0.3 80.5 0.3 89.5 0.9 84.8 CLIP-ICM 98.7 0.5 71.1 0.2 71.0 0.4 84.4 0.4 81.3 CLIP-ICM Linear-probe 98.4 0.3 69.3 0.2 79.8 0.2 88.6 0.2 84.0 CLIP-ICM 100.0 0.0 72.4 0.2 72.5 0.4 86.0 0.2 82.7 CLIP-ICM Linear-probe 99.7 0.3 70.7 0.2 81.0 0.4 90.0 0.3 85.4

Backbone is Vi T-B/16

CLIPOOD (Shu et al., 2023) 97.5 0.6 68.3 0.5 83.9 0.9 88.7 1.0 84.6 CLIP Zero-shot 99.9 0.0 70.1 0.0 73.5 0.0 86.1 0.0 82.4 CLIP Linear-probe 93.7 0.2 65.4 0.2 76.4 0.2 79.1 0.3 78.7 CLIP-ICM 100.0 0.0 74.7 0.4 74.5 0.4 87.1 0.7 84.1 CLIP-ICM Linear-probe 98.8 0.3 72.7 0.8 86.4 0.9 87.9 0.9 86.5 CLIP-ICM 98.8 0.2 73.6 0.2 73.9 0.5 87.3 0.4 83.4 CLIP-ICM Linear-probe 98.2 0.4 71.7 0.4 85.8 0.1 85.1 0.2 85.2 CLIP-ICM 100.0 0.0 75.6 0.2 79.2 0.4 90.0 0.1 86.2 CLIP-ICM Linear-probe 99.8 0.3 73.9 0.3 87.3 0.5 85.5 0.3 86.6

Learning Invariant Causal Mechanism from Vision-Language Models

Table 16. Accuracy on Domain Net dataset.

Algorithm CLIPART INFOGRAPH PAINTING QUICKDRAW REAL SKETCH Avg

Backbone is Res Net-50

ERM 58.1 0.3 18.8 0.3 46.7 0.3 12.2 0.4 59.6 0.1 49.8 0.4 40.9 IRM (Arjovsky et al., 2020) 48.5 2.8 15.0 1.5 38.3 4.3 10.9 0.5 48.2 5.2 42.3 3.1 33.9 Group DRO (Sagawa et al.) 47.2 0.5 17.5 0.4 33.8 0.5 9.3 0.3 51.6 0.4 40.1 0.6 33.3 Mixup (Yan et al., 2020) 55.7 0.3 18.5 0.5 44.3 0.5 12.5 0.4 55.8 0.3 48.2 0.5 39.2 MMD (Li et al., 2018) 32.1 13.3 11.0 4.6 26.8 11.3 8.7 2.1 32.7 13.8 28.9 11.9 23.4 DANN (Ganin et al., 2016) 53.1 0.2 18.3 0.1 44.2 0.7 11.8 0.1 55.5 0.4 46.8 0.6 38.3 ARM (Zhang et al., 2021) 49.7 0.3 16.3 0.5 40.9 1.1 9.4 0.1 53.4 0.4 43.5 0.4 35.5 CLIP Zero-shot 53.1 0.0 39.2 0.0 52.9 0.0 5.7 0.0 76.7 0.0 48.0 0.0 45.9 CLIP Linear-probe 59.2 0.1 38.6 0.9 54.1 0.7 8.2 0.7 56.4 0.7 39.6 0.7 42.6 CLIP-ICM 58.0 0.1 42.3 0.5 54.4 0.2 12.8 0.7 78.2 0.1 49.6 0.2 49.2 CLIP-ICM Linear-probe 63.1 0.1 50.1 0.2 66.2 1.0 19.7 0.3 83.8 0.4 63.2 0.3 57.8 CLIP-ICM 57.4 0.2 41.7 0.3 53.9 0.1 12.1 0.5 77.6 0.3 49.0 0.3 48.6 CLIP-ICM Linear-probe 62.4 0.2 49.4 0.2 65.7 0.1 18.9 0.4 83.2 0.5 62.6 0.2 57.0 CLIP-ICM 59.1 0.3 43.4 0.2 55.5 0.3 13.8 0.2 79.1 0.3 50.8 0.3 50.3 CLIP-ICM Linear-probe 64.1 0.2 51.3 0.5 67.2 0.5 20.6 0.5 84.9 0.3 64.1 0.2 58.7

Backbone is Vi T-B/16

CLIPOOD (Shu et al., 2023) 77.6 54.7 72.5 20.7 85.7 69.9 63.5 CLIP Zero-shot 70.2 0.0 46.6 0.0 65.0 0.0 13.7 0.0 82.9 0.0 62.7 0.0 56.8 CLIP Linear-probe 73.9 0.5 40.2 0.5 62.1 0.2 15.1 0.3 76.9 0.1 62.0 0.7 55.0 CLIP-ICM 77.1 0.5 51.8 0.5 67.9 1.0 16.7 0.3 82.9 0.8 66.9 0.2 60.5 CLIP-ICM Linear-probe 78.6 0.2 55.6 0.2 72.9 0.6 22.1 0.7 86.1 0.3 68.5 0.3 64.0 CLIP-ICM 75.9 0.5 50.1 0.3 66.2 0.4 15.0 0.5 81.4 0.1 55.8 0.3 57.4 CLIP-ICM Linear-probe 76.7 0.3 53.5 0.2 70.6 0.4 20.8 0.4 84.0 0.2 52.1 0.5 59.6 CLIP-ICM 78.1 0.4 52.8 0.3 69.0 0.2 17.6 0.4 84.2 0.2 64.8 0.2 61.1 CLIP-ICM Linear-probe 79.6 0.2 56.7 0.2 74.1 0.3 23.5 0.2 87.3 0.2 68.9 0.3 65.0