# probabilistic_languageimage_pretraining__9ef8439d.pdf

Published as a conference paper at ICLR 2025

PROBABILISTIC LANGUAGE-IMAGE PRE-TRAINING

Sanghyuk Chun Wonjae Kim Song Park Sangdoo Yun NAVER AI Lab

Vision-language models (VLMs) embed aligned image-text pairs into a joint space but often rely on deterministic embeddings, assuming a one-to-one correspondence between images and texts. This oversimplifies real-world relationships, which are inherently many-to-many, with multiple captions describing a single image and vice versa. We introduce Probabilistic Language-Image Pre-training (Pro LIP), the first probabilistic VLM pre-trained on a billion-scale image-text dataset using only probabilistic objectives, achieving a strong zero-shot capability (e.g., 74.6% Image Net zero-shot accuracy with Vi T-B/16). Pro LIP efficiently estimates uncertainty by an uncertainty token without extra parameters. We also introduce a novel inclusion loss that enforces distributional inclusion relationships between image-text pairs and between original and masked inputs. Experiments demonstrate that, by leveraging uncertainty estimates, Pro LIP benefits downstream tasks and aligns with intuitive notions of uncertainty, e.g., shorter texts being more uncertain and more general inputs including specific ones. Utilizing text uncertainties, we further improve Image Net accuracy from 74.6% to 75.8% (under a fewshot setting), supporting the practical advantages of our probabilistic approach. The code is available at https://github.com/naver-ai/prolip.

1 INTRODUCTION

Vision-language models (VLMs) aim for a joint vision-language embedding space, and have become a cornerstone in the recent advance of machine learning (Radford et al., 2021; Jia et al., 2021; Li et al., 2022; Zhai et al., 2023). For training, VLMs map an aligned image-text pair (e.g., an image and its corresponding caption) into the same space using contrastive learning. Their rich joint representations learned from large-scale image-text aligned datasets have achieved significant success in various downstream tasks, such as zero-shot classification (by treating class labels as a templated text, e.g., a photo of { }) or image-text cross-modal retrieval.

Despite their great success, most VLMs encode representations into a deterministic Euclidean space. This assumes a one-to-one correspondence between images and texts, which oversimplifies the complex nature of real-world relationships. In practice, image-text matching is inherently many-to-many. Multiple captions can accurately describe an image, each highlighting different aspects of the visual content. For example, a train image can be described by multiple captions, e.g., a train , train station or train parked next to a station . Conversely, a caption may correspond to several images describing similar scenes or objects, e.g., a train can be matched to all the train images. However, as shown in Figure 1 (b), a deterministic model (e.g., CLIP (Radford et al., 2021)) fails to capture the multiplicity, e.g., Train Station embedding is located to an irrelevant point to the other train images and captions. This is because the CLIP loss forces the positive pairs close and random negative pairs far away, which has no stable solution when we map them onto a point in Euclidean space.

Instead of representing an input to a deterministic point vector, we aim to map an input to a random variable. As shown in Figure 1 (a), our probabilistic VLM (Pr VLM) approach can handle the multiplicity, e.g., the distribution of Train Station covers all the train image distributions. This paper introduces Probabilistic Language-Image Pre-training (Pro LIP), the first Pr VLM pre-trained on billion-scale image-text pairs only using probabilistic objectives. Compared to previous Pr VLM works (Chun et al., 2021; Ji et al., 2023; Upadhyay et al., 2023; Chun, 2024), Pro LIP has several advantages. First, while the previous methods need a dedicated module to predict the uncertainty, Pro LIP estimates uncertainty very efficiently simply by adding an uncertainty token ([UNC]) to input without other additional parameters. Second, we introduce a novel inclusion loss, which en-

Published as a conference paper at ICLR 2025

(a) Pro LIP embedding space (b) Deterministic embedding space

Prob. image embeddings Prob. text embeddings Det. image embeddings Det. text embeddings

Remove background

Figure 1: Comparison of Pro LIP and deterministic embedding spaces. We visualize images and captions from MS-COCO Caption (Chen et al., 2015) using models trained on Data Comp 1B (Gadre et al., 2024) with 1.28B seen samples (See Appendix A.2 for more details of visualization method). Pro LIP can capture multiplicity of image-text matching (e.g., the text embedding of Train station covers all three train images), while deterministic embeddings fail to capture the ambiguity. Furthermore, when we synthetically remove the background, Pro LIP maps the new embedding near the original embedding but with a larger uncertainty value (0.109 0.117, while the deterministic model maps the new embedding very far from the original embedding.

forces the distributional inclusion relationship between an image-text pair and between the original input data and the masked one. Our new objective helps embeddings be more interpretable by humans. Third, Pro LIP can be trained from scratch without needing any pre-trained models and achieve state-of-the-art zero-shot capability without fine-tuning. Furthermore, Pro LIP achieves strong zeroshot capability, e.g., 74.6% Image Net zero-shot accuracy with the Vi T-B/16 backbone, where the CLIP model with the same number of seen samples achieves 73.5% (Ilharco et al., 2021).

In the experiments, Pro LIP slightly outperforms the deterministic CLIP model in zero-shot classification (ZSC) tasks (e.g., CLIP shows 67.2 Image Net ZSC accuracy, while Pro LIP shows 67.6). We also show the benefits of using uncertainty estimates for image-text tasks. First, we observe that our intuition and the learned uncertainty are aligned well. For example, (1) texts generally include images (i.e., texts are more uncertain than images), (2) shorter texts tend to be more uncertain, (3) more general texts/images tend to be more uncertain and include more specific ones (e.g., masked image and Train Station in Figure 1). Furthermore, we show two applications when a proper uncertainty estimate is helpful; Bayesian Prompt Re-Weighting (BPRW), a fully Bayesian approach to seek better Image Net zero-shot prompts which improves accuracy from 74.6% 75.8%), and the uncertainty-based dataset traversal, which provides a better understanding of dataset hierarchy.

2 PRELIMINARY

2.1 INHERENT AMBIGUITY INDUCED BY THE MULTIPLICITY OF IMAGE-TEXT PAIRS

The nature of image-text matching is many-to-many. Unfortunately, in practice, this multiplicity is not fully annotated in VL datasets; we only treat one corresponding caption as positive caption, while the others are considered as negative . For example, in COCO Caption (Chen et al., 2015), more than 80% of positive correspondences are labeled as negative (Chun et al., 2022). As observed by Chun (2024), this hidden multiplicity inherently causes ambiguity for VL datasets. For example, assume we have the caption a train is next to a train station and three semantically similar images showing a train next to a train station. Here, there will be only one positive image for the caption due to the construction protocol of VL datasets. If we approximate three image embeddings as the same image embedding, the correspondence between this approximated embedding, and the caption will be uncertain (i.e., either positive or negative). Suppose we use deterministic matching loss, such as contrastive loss used by CLIP (Radford et al., 2021). In this case, the best deterministic mapping will map the caption embedding not very close to the image embeddings but properly far away from them. CLIP does not have enough capability to capture multiplicity and ambiguity. We aim to achieve an embedding space that can represent the inherent uncertainty of the input (also known as aleatoric uncertainty ) for a more interpretable and understandable embedding space. We include more detailed discussion in Appendix A.1.

Published as a conference paper at ICLR 2025

L2-norm L2-norm

μv log σv 2

μt log σt 2

Visual Encoder Textual Encoder

Probabilistic Pairwise Contrastive Loss (Eq. 2) + Inclusion Loss (V T) (Eq. 5)

A grey cat wears a red hat

[CLSv][UNCv] [UNCt][CLSt]

Inclusion Loss (V Vmasked)

Inclusion Loss (T Tmasked)

A grey cat wears a red hat

[M] [M] cat [M] a [M] hat

Figure 2: Overview of Pro LIP. [CLS] and [UNC] tokens are used for µ and log σ2, respectively.

2.2 PROBABILISTIC IMAGE-TEXT REPRESENTATIONS

Probabilistic embeddings map each data point as a random variable (e.g., a Gaussian distribution) rather than a fixed vector, capturing the inherent uncertainty and diversity. This approach offers a better understanding of the semantic space by providing an extra axis of uncertainty, e.g., we can quantify the uncertainty of the input by using the estimated uncertainty (e.g., covariance of Gaussian). Recently, Kirchhof et al. (2023) theoretically have shown that a probabilistic representation learning with a proper probabilistic matching loss can recover the correct aleatoric uncertainty. Namely, a probabilistic mapping can capture the ambiguity of the inputs. Probabilistic embeddings have been actively studied for applications with inherent ambiguity, such as word embeddings (Nguyen et al., 2017), image embeddings (Oh et al., 2019), face understanding (Shi & Jain, 2019; Chang et al., 2020), 2D-to-3D pose estimation (Sun et al., 2020), speaker diarization (Silnova et al., 2020), video understanding (Park et al., 2022), and composed image retrieval (Neculai et al., 2022).

As we discussed in Section 2.1 and A.1, VL tasks also suffer from aleatoric uncertainty caused by the inherent multiplicity of image-text matching and sparse annotations. Recently, there have been attempts to tackle the inherent ambiguity of VL tasks with probabilistic embeddings (Chun et al., 2021; Ji et al., 2023; Upadhyay et al., 2023; Chun, 2024). However, these methods have a very limited scale to be used as a generic purpose VLM, such as CLIP. For example, Prob VLM (Upadhyay et al., 2023) is an ad-hoc module top on the frozen pre-trained CLIP, limiting the full exploration of the probabilistic space. Furthermore, Prob VLM is only trained on small image caption datasets, such as CUB (Wah et al., 2011) or COCO caption (Chen et al., 2015), which makes it not applicable to more practical zero-shot classification applications. MAP (Ji et al., 2023) proposes a pre-training method using a cross-attention Transformer. However, it has a limited zero-shot capability, resulting in the need to fine-tune the model for each downstream task. Furthermore, its structure is highly inefficient for retrieval systems; it needs both image and text pair to compute a similarity between them, i.e., we have to compute all possible image-text pairs to get the full similarity. Lastly, PCME++ (Chun, 2024) showed a possibility of pre-trained Pr VLM, but their scalability is still limited (e.g., achieving 34% Image Net zero-shot accuracy). We empirically observe that the objective function of PCME++ shows slow or unstable training under large-scale image-text pairs. Furthermore, all these Pr VLMs need heavy additional parameters to estimate uncertainty from data. Pro LIP does not need a dedicated module for uncertainty estimate but employs a very efficient strategy using [UNC].

3 PROBABILISTIC LANGUAGE-IMAGE PRE-TRAINING

3.1 ARCHITECTURE

We model an input as a Gaussian random variable with a diagonal covariance by estimating mean µ and variance σ2 vectors from the input. Similar to CLIP (Radford et al., 2021), Pro LIP has separate visual and textual encoders. We use Vision Transformer (Vi T) (Dosovitskiy et al., 2021) for the visual encoder and Transformer (Vaswani et al., 2017) for the textual encoder.

Published as a conference paper at ICLR 2025

Previous Probabilistic VLMs (Pr VLMs) introduce additional parameters for estimating uncertainty. For example, PCME++ (Chun, 2024) uses one multi-head self-attention block for this. However, this approach will require additional parameters and computational costs, limiting usability (See Table C.6). Instead, we introduce a new uncertainty token [UNC], along with the class token [CLS] (See Figure 2). Compared to the previous Pr VLMs, [UNC] requires almost negligible additional parameters. The visual encoder takes [CLS] and [UNC] at the beginning of the input sequences, while the textual encoder takes [UNC] and [CLS] at the end of the input. This is because the textual encoder of the original CLIP uses the end-of-sentence token rather than [CLS] at the beginning. Note that we assume diagonal covariance for simplicity, namely, [UNC] is the same dimension with [CLS]. We use the L2-normalized [CLS] output as µ and [UNC] output as log σ2. Similar to [CLS], [UNC] is projected to the final embedding space using a linear layer. We initialize the bias value of this layer to a small value (e.g., 10) to initialize the initial σ2 scale small (e.g., exp( 10) 5 10 5 for each dimension). This simple trick helps stable training.

3.2 PROBABILISTIC PAIRWISE CONTRASTIVE LOSS

In this subsection, we introduce the probabilistic pairwise contrastive loss (PPCL), the main objective function of Pro LIP. PPCL is similar to the probabilistic matching loss (PML) of PCME++, but we modify PML for stable training based on the log sigmoid loss by Sig LIP (Zhai et al., 2023).

Following PCME++, we use the closed-form sampled distance (CSD) as our probabilistic distance: d CSD(Z1, Z2) = EZ1,Z2 Z1 Z2 2 2 = µ1 µ2 2 2+tr(Σ1+Σ2) = µ1 µ2 2 2+ σ1+σ2 1, (1) where Z1 and Z2 are Gaussian random variables with diagonal covariances. The probabilistic matching loss by PCME++ uses pairwise binary cross entropy (BCE) by taking a d CSD(Z1, Z2) + b as logits, where a and b are learnable scalars. However, we empirically observe that PML fastly converges to a small value, and its gradient is dramatically small, which makes the overall learning procedure slow or unstable (see Appendix A.3 for details). To solve the problem, we employ log sigmoid loss (Zhai et al., 2023). By replacing the squared L2 distance µ1 µ2 2 2 to Equation (1) (details are in Appendix A.4), we have a new probabilistic pairwise contrastive loss (PPCL):

LPPCL(Zv, Zt) = log 1 1 + exp(yvt( a(µ v µt 1

2tr(Σv + Σt)) + b)), (2)

where a and b are learnable scalar values and yvt is 1 if v and t are matched otherwise -1.

3.3 INCLUSION LOSS

Although PPCL enables to learn probabilistic representations, we empirically observe that the learned uncertainty from data is often counterintuitive to humans. For example, we may expect that text captions with a general meaning (e.g., photo ) has a very large covariance that can cover all the photographic image embeddings, but sometimes it does not. Similarly, we may expect that if a text or an image loses some information (e.g., some tokens are randomly masked), its probability distribution will entail the distribution of the original sample. However, it is not always guaranteed that a model will learn desired uncertainty, especially under noisy image-text correspondences.

To tackle the issue, we introduce a novel objective function enforcing a random variable Z1 to be included in another random variable Z2. Let p1 and p2 be their corresponding probability density function (pdf). If Z1 is included in Z2, then we can presume that the area with high p1 will be overlapped to the area with high p2. From this observation, we propose a novel inclusion measure by emphasizing the area with high p1 and compute expectation of the emphasized p1 under the distribution p2. Specifically, we take the square to p1, and compute R p1(x)2p2(x)dx. This measure is related to Bhattacharyya distance ( R p1p2dx) or the inner product ( R p1p2dx), but our measure is designed for measuring inclusion (it becomes high if Z1 is included in Z2 and otherwise low) while the others are designed for measuring distance or dissimilarity between distributions.

The log inclusion measure (omitting constants) can be derived as follows:

inc(Z1, Z2) = log Z

p2 1(x)p2(x)dx = 2 log σ2 1 log σ2 2 1

2 log(A) + B2

where A = 1

σ2 1 + 1 2σ2 2 , B = 2µ1

σ2 2 , C = µ2 1 σ2 1 + µ2 2 2σ2 2 . (3)

Published as a conference paper at ICLR 2025

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0

3.5 pdf of Z1 = N(0.01, 0.3)

squared pdf of Z1 pdf of Z2 = N(0.3, 0.5) H(Z1 Z2) = 0.917

inc(Z1, Z2) = 4.773 KL(Z1, Z2) = 0.325

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0

3.5 pdf of Z1 = N(0.01, 0.3)

squared pdf of Z1 pdf of Z3 = N(1.1, 0.2) H(Z1 Z3) = 2.270

inc(Z1, Z3) = 0.535 KL(Z1, Z3) = 32.763

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0

3.5 pdf of Z3 = N(1.1, 0.2)

pdf of Z4 = N( 0.9, 0.2) H(Z3 Z4) = 0.000

inc(Z3, Z4) = 25.489 KL(Z3, Z4) = 101.000

Figure 3: Visual understanding of the proposed inclusion loss. We plot probabilistic density functions (pdfs) of three pairs of Gaussian distributions and their inclusion hypothesis, H(Z1 Z2) (Equation (4)), log inclusion (Equation (3)) and KL divergence. The dashed line denotes the squared pdf, i.e., p2(x). H(Z1 Z2) becomes (a) positive if Z1 is included in Z2 and (b) otherwise negative. (c) If Z1 and Z2 are the same level, then H will become zero. While log inclusion represents how Z1 is included in Z2, KL measures the dissimilarity between distributions (e.g., (c) has the largest KL, but the smallest inc ). Figure A.2 shows more examples.

The full derivation is in Appendix A.5.

Now, using Equation (3), we introduce a hypothesis test whether Z1 is included in Z2 as follows:

H(Z1 Z2) = log Z

p2 1(x)p2(x)dx log Z

p1(x)p2 2(x)dx. (4)

H is positive if the hypothesis is true and otherwise negative (See Figure 3). It has two distinct properties compared to other probabilistic measures. First, H is asymmetric. The most probabilistic measures aim to measure the distance , overlapping or dissimilarity between two distributions, therefore symmetric (e.g., Wasserstein distance and Bhattacharyya distance). Meanwhile, we measure the level of inclusion of two random variables, therefore asymmetric: if Z1 is included in Z2, then Z2 will not be included in Z1, i.e., H(Z1, Z2) = H(Z2, Z1). Compared to the asymmetric measure, such as KL divergence, we aim to measure how Z1 is included in Z2. In contrast, KL measures the dissimilarity between them based on relative entropy. As shown in Figure 3, even if Z1 and Z2 have the same variance, we have very high KL, while our measure becomes very small.

Similarly to PPCL, we use the log sigmoid loss for stable convergence. We use log R p2 1(x)p2(x)dx log R p1(x)p2 2(x)dx as the logit value, where the logit becomes positive if Z1 is included in Z2. Now, we introduce our novel inclusion loss as follows:

Linclusion(Z1 Z2) = log 1 1 + exp ( c H(Z1 Z2)), (5)

where c is a positive scalar. For stability, we fix c as a large value, such as 1000. Like KL divergence, inclusion loss can be volatile if variance values become extremely small. To prevent a loss explosion, during training, we multiply a small ε to 1 σ2 for computing A, B, C in Equation (3), mimicking each Gaussian has sufficiently large variances multiplied by 1/ε. See Table C.5 for more details.

Using the inclusion loss, we enforce two properties to the model. First, we let text distribution include image distribution, i.e., Linclusion(Zv Zt). This intuition is from observations from the previous studies showing that text entails image (Chun et al., 2021; Desai et al., 2023; Chun, 2024; Kim et al., 2024). Conceptually, when we describe an image, we select the product of the relevant concepts by an arbitrary choice. Therefore, text usually has more general information than images. As another property, we let an embedding of partial information include the embedding of its full information, i.e., Linclusion(Z Zpartial). For example, we generate a text containing partial information of the original caption by masking out random tokens (Devlin et al., 2018). Similarly, we generate a partial image by masking out the image tokens (He et al., 2022). In practice, we mask out 75% of the input tokens to generate partial information. For text, we replace the input tokens with [MASK], and for image, we drop the input patch tokens for efficient computation (Li et al., 2023).

Finally, we use the VIB loss as a regularization of each Gaussian embedding (i.e., preventing too small σ2) following Chun et al. (2021) and Chun (2024) See Appendix A.7 for the details. Putting

Published as a conference paper at ICLR 2025

Equation (2) and Equation (5) all together, we have the following learning objective:

Zt T LPPCL(Zv, Zt) + X

Zv V [α2Linclusion(Zv Zvmasked) + βLVIB(Zv)]

Zt T [α2Linclusion(Zt Ztmasked) + βLVIB(Zt)] + X

(v,t) (V,T ) α1Linclusion(Zv Zt), (6)

where α1, α2, β are control hyperparameters for each loss function. For computational efficiency, we generate masked samples only for 12.5% samples in the mini-batch and compute inclusion loss using them. VIB loss is computed for all samples. We report the loss ablation study in Appendix C.3.

3.4 PROMPT TUNING WITH UNCERTAINTY ESTIMATES

As observed by Chun (2024), the estimated uncertainty is not only beneficial to understanding the input data uncertainty but also effective to zero-shot classification (ZSC). In practice, we use multiple templated text prompts to estimate the textual embedding of class names. For example, the original CLIP paper uses 80 prompts, including a photo of { } . Although this prompt engineering with a mixture of templates significantly improves ZSC performances, it is still unclear which template benefits ZSC. Furthermore, if we carefully explore images of each class, we can conjecture that each class might need different templates. For example, in Image Net, ferret images often co-occur with the ferret s owner. In this case, prompts like a photo of my ferret can be helpful to estimate a text feature corresponding to the images, compared to using a origami ferret .

How can we select the most informative text prompts? One possible solution will be filtering out highly uncertain text prompts. For example, for black-footed ferret class, the embroidered { } , the origami { } or the plastic { } have high uncertainty values, while a low resolution photo of { } or a cropped photo of { } have small uncertainty values. Our experiment shows this strategy is moderately effective: it improves Image Net ZSC accuracy +0.1pp. We presume that it is because the variety of text prompt uncertainties for each class is not significantly large. Also, we presume that the suitability of text prompts is not solely dependent on the text itself; we may need to consider how texts describe the corresponding images well. We propose Bayesian Prompt Re-Weighting (BPRW), a simple probabilistic approach to find the optimal weight of prompts for each class.

Let πc RN be the weight of each prompt, where N is the number of prompts (i.e., 80 for Image Net) and c is the class index. Our goal is to find the best πi that the new text embedding Znew t = P

i πi c Zi t describe the given M image embeddings Zj v. To achieve this goal, we optimize πc to have the best posterior for Zt and Zv. First, we sample K point vectors for each Zj v and assume they are observations (total K M point vectors). Next, we optimize a simple Expectation Maximization (EM) algorithm to find the best π achieving the best log-likelihood. Here, we set a Dirichlet prior for πc using the uncertainty values of each Zt, i.e., a prompt with higher uncertainty has a smaller prior. Due to the page limit, we describe the detailed algorithm in Appendix A.8.

Although our algorithm is theoretically well-founded and flexible by a Bayesian approach (e.g., setting prior assumption using σ2 t ), it needs the image embeddings corresponding to the target class, which violates the ZSC assumption. We tackle this issue by collecting corresponding image embeddings using KNN for each text class embedding. If we can use some true pairs under a few-shot setting (e.g., 5 true images for each class), we observe a significant performance improvement.

4 EXPERIMENTS

4.1 IMPLEMENTATION DETAILS AND EXPERIMENTAL PROTOCOL

Model. We use Vi T-B/16 (Dosovitskiy et al., 2021) as our image encoder and a 12-layer 768-wide Transformer (Vaswani et al., 2017) as our text encoder. We set the embedding dimension to 768 and the context length to 64 tokens, following Sig LIP Vi T-B/16 (Zhai et al., 2023).

Optimization. We implement Pro LIP based on openclip (Ilharco et al., 2021) and the Data Comp1B dataset (Gadre et al., 2024). We list the optimization hyperparameters in Appendix B.1. We train Pro LIP models using 32 NVIDIA H100 GPUs with Bfloat16 precision, taking about one day to train a Vi T-B/16 model with 1.28B seen samples. We initialize the bias value of the linear projection top

Published as a conference paper at ICLR 2025

on [UNC] to 10 to initialize log σ2 with a small value. We set the initial scale and bias parameters (a and b in Equation (2)) to 10 and 10, following Zhai et al. (2023). We randomly select 12.5% image-text pairs from the mini-batch and masking out their 75% information using [MASK] for texts (Devlin et al., 2018) and token drop for images (He et al., 2022). Fine-tuning details are in Appendix B.2.

Evaluation. We evaluate the models on 38 tasks of Datacomp evaluation suite (Gadre et al., 2024) the full evaluation datasets are listed in Appendix B.3. We report five categories: Image Net, 6 Image Net variants with distribution shifts, 13 VTAB tasks, 3 retrieval tasks, and the average of 38 tasks. In addition, we employ the Hierar Caps dataset (Alper & Averbuch-Elor, 2024), which provides captions with four different hierarchies (e.g., water sports kite surfing kite surfer on top of the board kite surfer in the air on top of a red board ). Similarly, we construct new Hierar Imgs dataset, which provides images with four different hierarchies (See Figure B.2 for examples). We will describe the details of Hierar Caps, Hierar Imgs, and their evaluation in Section 4.4.

4.2 MAIN RESULTS

Table 1 shows the main result. We use multiple prompts for each task following the Data Comp evaluation suite. Similar to CLIP zero-shot classification (ZSC), Pro LIP uses the ensemble of multiple prompts by Zmixed t = N( 1

i σ2 i ), where µi and σ2 i denote the mean and variance of i-th prompt and N is the number of prompts (e.g., 80 for Image Net). Note that if we treat this operation as the average of N random variables, then the variance should be 1 N 2 P

i σ2 i , but we empirically observe that the division decreases the final ZSC performance, e.g., 74.51 where our approach shows 74.58 on Image Net. We did not use the uncertainty-based ZSC described in Section 3.4 for evaluating 38 tasks. Instead, we use CSD to find the nearest class text embedding.

Table 1 shows that Pro LIP outperforms CLIP in all metrics with 1.28B seen samples. Furthermore, when we train Pro LIP with 12.8B seen samples, we achieve a high-performing Pr VLM. We show more ablation studies in Appendix C.3, including loss design, hyperparameter, and architecture.

Table 1: Zero-shot classification results. The full results for each task are in Table C.1. Vi T-L/16 and SO400M/14 results are the fine-tuned results from the pre-trained Sig LIP models. More results in Table C.2.

# Samples Seen Image Net IN dist. shifts VTAB Retrieval Average

CLIP 1.28B 67.2 55.1 56.9 53.4 57.1 Sig LIP 1.28B 67.4 55.4 55.7 53.4 56.7 Pro LIP 1.28B 67.8 55.3 58.5 53.0 57.9

Pro LIP 12.8B 74.6 63.0 63.7 59.6 63.3

Vi T-L/16 Pro LIP 1.28B* 79.4 68.6 64.0 61.3 65.9

Vi T-SO400M/14 Pro LIP 1.28B* 79.3 69.0 65.1 62.5 66.6

4.3 UNDERSTANDING THE LEARNED UNCERTAINTY

Uncertain samples visualization. From Equation (1), we can define the uncertainty of the given input by measuring tr(Σ), namely, P

i σ2 i (we simply denote σ2 v for image uncertainty and σ2 t similarly). Figure 4 shows the samples with low and high uncertainty values using this value. We extract samples from the 3.5M subset of 12.8B Data Comp Common Crawl small (Gadre et al., 2024) filtered by CLIP similarity and English filtering 1. We use Pro LIP with 12.8B seen samples for the analyses.

Figure 4 shows that the texts with more general meanings have high uncertainty, e.g. Screenshot or graphic . This is because a shorter text with more general meaning has more opportunity to be matched to various images. In contrast, certain captions describe a longer and distinct context, such as the exact address or proper nouns, which is unlikely matched to multiple images. We will show that the context length of the text is highly correlated to the uncertainty value (Figure 6). Interestingly, there are captions with high uncertainty despite long context lengths (e.g., more than the specified context length). We empirically observe that such captions have almost no information, showing that Pro LIP captures the text uncertainty well. The examples are shown in Appendix C.1.

1https://huggingface.co/datasets/nielsr/datacomp-small-filtered

Published as a conference paper at ICLR 2025

More uncertain samples More certain samples

- 3BR house 10mins to Gaisano Mall & the city proper - Butuan City - 独立屋 - Afia Schwar wildly jubilates after Mahama won NDC presidential primaries - Felipe da, Male, 27, $350, No pets, No children, and Non-smoker - Valeri Bozhinov: I hope God is a Bulgarian - Philippa Hearnden, Head of Business, moving on to new opportunity. - "I was lucky to meet you," said former NBC News producer Truus Bos, who first J.B. Rutagarama as a teenager at a Rwandan refugee camp.

- extech rh520a chart recorder, humidity & temperature - Malibu Beach Inn Updated 2020 - Cat on books seen in Ischia Ponte, Campagna/Italy - Living Room Table Design Wooden Dining Room Comely Image Of Furniture For Home Interior And - Maruti Suzuki Alto K10 (AC) (Fully loaded) 2010 Petrol Well Maintained - Kyoto of Japan Round [10"] Pine Cone Vegetable Dish - 173 Crescent St, Rockland, MA 02370

- Measuring the Effectiveness of Your Church Website (Google Analytics) (1).png - Mary Gypsioti - robe longue de plage beauty roxy marshmallow land of tehotihuac roxy la redoute. Black Bedroom Furniture Sets. Home Design Ideas - Lancashire Tourism Awards 2015 - Signature Line Celery 1 x 3 Glass Subway Tile in Green/Brown/Gray by Susan Jablon - Catherine J Terry - The Baroque route , Bolivia

- Places To Go In South Island - huawei mate xs - LEGO Pirates Torso without Arms - Chocolate Frosted Cupcakes on a Plate Photo Classic Round Sticker - Boxer Watercolor Greeting Card by Naxart Studio - The Sabir Bey Show apk screenshot - Primary photo for Pat Hogan, Deceased - Boston skyline in purple radiant orchid vector - Kitting For<br /> All Industries - Down by 2 in the past 45 days

- And the obligatory coffee - Jay Jay Ink - Men's summer sport coats - Chefs for Seals Facebook Page - ＥＺ(11枚目) - Mary Thompson - smart lighting for home - Auto Speakerscreenshot thumbnail - Hotel photo: My Hostel - The Life and Selected Writings - Classic ankle boots - grey - Tec Rail - Severs Table 1.0

- Screenshot - channel image - [graphic] - : facebook : - Security Image - #original - [graphic] - ######### - downtown... - Student Day - Series image - Security Image - Upcoming Events

σt = 0.0625 ~ 0.0673 99.9% < P(σt) σt = 0.0460 ~ 0.0468 79% < P(σt) < 81% σt = 0.0415 ~ 0.0418 59% < P(σt) < 61% σt = 0.0391 ~ 0.0393 39% < P(σt) < 41% σt = 0.0369 ~ 0.0371 19% < P(σt) < 21% σt = 0.0264 ~ 0.0303 P(σt) < 0.1%

σv = 0.0464 ~ 0.0613 99.9% < P(σv) σv = 0.0253 ~ 0.0260 79% < P(σv) < 81% σv = 0.0194 ~ 0.0199 59% < P(σv) < 61% σv = 0.0154 ~ 0.0158 39% < P(σv) < 41% σv = 0.0123 ~ 0.0126 19% < P(σv) < 21% σv = 0.0054 ~ 0.0072 P(σv) < 0.1%

Figure 4: Uncertain & certain samples. Visualization from the 3.5M filtered Data Comp Small pool.

0.01 0.02 0.03 0.04 0.05 0.06 0.07 Uncertainty values

Image and Text uncertainties (Data Comp small)

Image uncertainty Text uncertainty

Figure 5: σ2 v vs. σ2 t . Generally, texts are more uncertain than images, as shown in Figure 4.

0 10 20 30 40 50 60 Input context length

Text uncertainty

Text uncertainty vs. text token length (CC3M)

Figure 6: σ2 t vs. context length. Shorter texts are generally more uncertain than complex ones.

0.035 0.040 0.045 0.050 0.055 0.060 0.065 Text uncertainty

Text uncertainty vs. Text hierarchy (Hierar Caps)

Level 0 Level 1 Level 2 Level 3

Figure 7: σ2 t by Text hierarchy. More general captions (i.e., lower level) are more uncertain.

We can see certain and uncertain images in the upper row of Figure 4. On the uncertain image side, we can find images solely with an object (e.g., a clock, a book, or a clipart) on a white background. Generally, such images can be matched to multiple possible descriptions, e.g., the name of the product, the detailed explanation of the product, or the written text in the image (e.g., the book title). On the other hand, certain images have more complex visual cues that can be described with more and specific captions. More samples with high and low uncertainty can be found in Figure C.1.

Statistics of σ2 v and σ2 t . In Figure 4, we also observe that σ2 v is generally smaller than σ2 t , would be originated from the inclusion loss Linclusion(Zv, Zt) in Equation (6). Figure 5 shows that the image embeddings and text embeddings have distinct uncertainty values. In Appendix C.6, we provide a detailed discussion of the relationship between learned uncertainty and human preference.

What is the source of the uncertainty? We answer this question by analyzing text context length and data hierarchy. First, we plot the text uncertainty by the context length on the Conceptual Caption 3M (CC3M) captions (Sharma et al., 2018). As shown in Figure 6, a short caption tends to have a large uncertainty value. For example, we observe that the caption film series has the largest uncertainty value in CC3M, while the caption gangsta rap artist told by person @ person I almost died you have to see this! is the most certain caption. Namely, more uncertain captions tend to be more logically general captions, such as dress - sewing pattern or person before & after , while more certain captions specify a particular situation. From this observation, we explore the relationship between the uncertainty and varying levels of description. We employ the Hierar Caps dataset, whose images have four levels of descriptions from the full caption of COCO Caption and its logical entailment hierarchy with three different levels (e.g., bird blue bird ...). Examples are shown in Figure 9. Figure 7 shows the relationship between the text uncertainty and text hierarchy levels. Here, Level 0 denotes the most general captions, e.g., chair or bird , and Level 4 represents the original COCO Caption. As shown in the figure, more general captions (lower levels) tend to be more uncertain, while more specific captions (higher levels) tend to be less uncertain.

Published as a conference paper at ICLR 2025

Segmentation mask

Orignal image

7.5 5.0 2.5 0.0 2.5 5.0 7.5 Hypothesis value

70.5% samples satisfy inclusion hypothesis for level 0

6 4 2 0 2 4 6 Hypothesis value

75.2% samples satisfy inclusion hypothesis for level 1

6 4 2 0 2 4 6 Hypothesis value

76.7% samples satisfy inclusion hypothesis for level 2

Figure 8: σ2 v by image hierarchy using Hierar Imgs. We tested the inclusion hypothesis of the original image and masked images from level i, H(orig i), and its inverse hypothesis, H(i orig). In all tests, more than 70% images are included in their lower-level images (purple histogram bars with positive hypothesis values). The dataset construction of Hierar Imgs and related discussions can be found in Appendix B.5 and C.5.

We further investigate whether a similar phenomenon happens for the visual modality by constructing a new Hierar Imgs dataset to represent the logical visual hierarchy. As shown in the upper row of Figure 8, our Hierar Imgs dataset consists of four levels: Level 4 is the original image and Level 0 is the largest visual segment. The details of the dataset construction can be found in Appendix B.5.

Using the images, we analyze the relationship between the visual uncertainty and varying levels of visual information. We test whether each image becomes more uncertain than the original image by applying the inclusion hypothesis. Namely, we test if a lower-level image includes its original image by Equation (4). As shown in Figure 8, most of the images satisfy the inclusion hypothesis (e.g., more than 70%), implying that Pro LIP also captures image hierarchy. In Appendix C.5, we explain more details of the image hierarchy experiments, including the absolute σ2 v value by different levels and possible pitfalls of Hierar Imgs (e.g., we need more careful filtering for a reliable evaluation).

4.4 APPLICATION USING UNCERTAINTY

Image traversals. Following Alper & Averbuch-Elor (2024), we first set the [ROOT] embedding. Then, we retrieve the nearest caption of the given image, and interpolate [ROOT] and the text embedding with 50 equally spaced steps. The null text embedding "" is used as the [ROOT] of the CLIP embedding space. In Pro LIP case, we can utilize the additional uncertainty information and the inclusion hypothesis. Hence, we search the root embedding of the given embedding by searching the text embedding that includes the given image embedding most. The other procedure equals to Alper & Averbuch-Elor (2024). We perform traversals on Hierar Caps. Details are in Appendix B.4.

We show image traversal results in Figure 9. Interestingly, the most inclusive caption (i.e., [ROOT]) for each image is not always same to the ground truth level 0 caption of Hierar Caps. For example, our estimated [ROOT] for the vase picture is vase , while the GT level 0 caption is object . From the observation, we can presume that although our retrieval results are plausible, it could lead to inferior Hierar Caps retrieval results. To ensure more diversity, we take an average of "" and the most inclusive text embedding and use it as the root embedding. Using the newly proposed root embedding, we quantitatively measure the performance of our traversal in Table 2. First, our new [ROOT] embedding is more specialized to the inputs, rather than only using ""; we can achieve higher R@1[ROOT] using our approach. The table shows that the proposed probabilistic image traversal achieves higher recall and precision than deterministic traversal using "" as [ROOT]. Namely, the probabilistic approach gives more opportunities to get more precise captions during the traversal.

Uncertainty-based Image Net prompt enhancement. As described in Section 3.4, we propose BPRW, a new prompt re-weighting method to find a weight πc for each class. A text embedding weighted by πc, i.e., Znew t = P

i πi c Zi t will be used as a new class embedding for ZSC. Table 3 shows the Image Net classification results with different strategies. First, solely using text uncertainty and filtering out uncertain texts are not sufficiently effective. They only improve about +0.05pp top-1 ac-

Published as a conference paper at ICLR 2025

Top-1 Retrieved captions

Query images

Hierar Caps GTs

Most inclusive captions

Kite surfer in the air on top of a red board

Kite surfer in the air on top of a red board

Kite surfer in the air

kite surfing

Kite surfer in the air on top of a red board

Kite surfer on top of the board

kite surfing water sports

A blue bird sitting on the top of a branch with autumn leaves

A blue bird sitting on the top of a branch with autumn leaves Blue bird sitting on top of branch

A blue bird sitting on the top of a branch with autumn leaves

Blue bird sitting on top of branch

An outdoor table containing assorted bowls of food and beer

An outdoor table containing assorted bowls of food and beer

Outdoor picnic

An outdoor table containing assorted bowls of food and beer

Outdoor table with food and beer

outdoor table

A large red chair has a horse statue on it

A large red chair has a horse statue on it

Red chair has a horse statue

A large red chair has a horse statue on it

Red chair has a horse statue

Leaves and purple flowers come out of a brown vase on a desk

Leaves and purple flowers come out of a brown vase on a desk

Vase with flowers in it

There is a small glass vase that has purple flowers in it

Vase with flowers in violet

vase object

Figure 9: Image traversals with Pro LIP. For each image, we estimate the [ROOT] caption which include the image most using Equation (4). Then, we interpolate the [ROOT] caption and the retrieved caption. We compare our interpolation and Hierar Caps GTs. Red denotes when the estimated and GT roots are different.

Table 2: Hierar Caps retrieval. We measure the precision and recall of 50 traversal results. Prob? denotes inclusion-based traversal, using the average of "" and [ROOT] in Figure 9 as a new [ROOT]. R@1[ROOT] is recall of [ROOT] embeddings.

Prob? Prec R@1 R@1[ROOT]

CLIP (1.28B) 25.0 63.0 0.1 Pro LIP (1.28B) 28.4 62.6 0.1 Pro LIP (1.28B) 35.9 62.9 15.3

Pro LIP (12.8B) 31.7 67.8 0.1 Pro LIP (12.8B) 41.1 68.0 23.3

Table 3: Image Net prompt tuning by Pro LIP. K denotes the number of few-shot samples for each class (if required). Open AI 80 prompts is the same as Image Net of Pro LIP 12.8B in Table 1.

Prompt strategy Accuracy K

a photo of { } 73.7 - Open AI 80 prompts 74.6 -

Filtering by σ stats 74.6 (+0.03) - Top-K prompts 74.7 (+0.07) -

BPRW (proposed) 74.7 (+0.12) 0 BPRW (proposed) 75.6 (+0.99) 5 BPRW (proposed) 75.8 (+1.21) 9

curacy. On the other hand, BPRW achieves better accuracy by using additional visual information; we have +0.12pp for Image Net ZSC. If we can use a few labeled images per class (e.g., five for each class), we can get over 1.2% accuracy improvement by adjusting their weight. In Appendix C.7, we describe more details of BPRW, including hyperparameters and more visualization results of the learned weights by BPRW. Interestingly, the learned weights follow the actual image distributions; if the images are mostly low resolution, a low resolution prompt becomes the most significant.

More examples and discussion. We include more applications in Appendix C.8, including filtering, dataset understanding, and Long Pro LIP. Appendix D discusses the diagonal covariance assumption.

5 CONCLUSION

In this work, we introduced Probabilistic Language-Image Pre-training (Pro LIP), a fully probabilistic vision-language model that addresses the limitations of deterministic embeddings by capturing the inherent multiplicity in image-text relationships. By mapping inputs to random variables and efficiently estimating uncertainty through an uncertainty token ([UNC]), Pro LIP models distributional relationships without additional parameters. The inclusion loss further enhances interpretability by enforcing distributional inclusion between image-text pairs and between original and masked inputs. Our experiments demonstrate that Pro LIP is not only beneficial in zero-shot classification tasks but also provides an additional axis of understanding input data by capturing their uncertainty. Our approach highlights the potential of uncertainty modeling in vision-language applications.

Published as a conference paper at ICLR 2025

AUTHOR CONTRIBUTIONS

This project is the extension of Chun et al. (2021; 2022); Chun (2024). S Chun led the project; the other authors actively and significantly contributed to the project with advice and feedback. W Kim and S Yun contributed to the baseline openclip implementation and evaluation toolkits from Kim et al. (2024); S Chun developed the main module upon the baseline implementation. The main ideas, such as [UNC], probabilistic pairwise contrastive loss, and inclusion loss, are designed and implemented by S Chun. S Park contributed to constructing the Hierar Imgs dataset. S Chun wrote the initial version of the manuscript. S Park contributed to a better visualization. All authors contributed to the final manuscript.

REPRODUCIBILITY STATEMENT

For reproducible research, we use the open-source training dataset (Data Comp 1B (Gadre et al., 2024)) and codebase (openclip (Ilharco et al., 2021)). As we clarified in Appendix B.3, we had 1,121,356,767 number of valid URLs among 1.39 billion URLs and 1,118,443,492 number of unique images after de-duplicating URLs (there were 147,676,246 number of duplicated URLs in the Data Comp 1B URLs). All the detailed hyperparameters are clarified in Appendix B.1. Our results can be reproducible by our open-source implementation (https://github.com/naver-ai/ prolip) and released pre-trained weights in Hugging Face (https://huggingface.co/ collections/Sanghyuk Chun/prolip-6712595dfc87fd8597350291); the released weights include Pro LIP Vi T-B/16 (from-scratch), Vi T-L/16, Vi T-SO400M/14, Vi T-H/14 (finetuned), and Long Pro LIP Vi T-B/16 fine-tuned with different datasets.

ACKNOWLEDGEMENTS

We thank the internal infrastructure team who helped to download the Data Comp 1B dataset.

Morris Alper and Hadar Averbuch-Elor. Emergent visual-semantic hierarchies in image-text representations. In European Conference on Computer Vision (ECCV), 2024. 7, 9, 22, 23

Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In Advances in Neural Information Processing Systems (Neur IPS), volume 32, 2019. 21

Sara Beery, Arushi Agarwal, Elijah Cole, and Vighnesh Birodkar. The iwildcam 2021 competition dataset. ar Xiv preprint ar Xiv:2105.03494, 2021. 22

Lucas Beyer, Olivier J H enaff, Alexander Kolesnikov, Xiaohua Zhai, and A aron van den Oord. Are we done with Image Net? ar Xiv preprint ar Xiv:2006.07159, 2020. 30

Yonatan Bitton, Nitzan Bitton Guetta, Ron Yosef, Yuval Elovici, Mohit Bansal, Gabriel Stanovsky, and Roy Schwartz. Winogavil: Gamified association benchmark to challenge vision-and-language models. Advances in Neural Information Processing Systems (Neur IPS), 35:26549 26564, 2022. 21

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 mining discriminative components with random forests. In European Conference on Computer Vision (ECCV), pp. 446 461. Springer, 2014. 22

Jie Chang, Zhonghao Lan, Changmao Cheng, and Yichen Wei. Data uncertainty learning in face recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5710 5719, 2020. 3

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale imagetext pre-training to recognize long-tail visual concepts. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3558 3568, 2021. 21, 23, 26

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Doll ar, and C Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server. ar Xiv preprint ar Xiv:1504.00325, 2015. 2, 3, 21

Published as a conference paper at ICLR 2025

Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 105(10):1865 1883, 2017. 21

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818 2829, 2023. 21

Gordon Christie, Neil Fendley, James Wilson, and Ryan Mukherjee. Functional map of the world. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6172 6180, 2018. 22

Sanghyuk Chun. Improved probabilistic image-text representations. In International Conference on Learning Representations (ICLR), 2024. 1, 2, 3, 4, 5, 6, 11, 16, 17, 19, 21, 32

Sanghyuk Chun and Sangdoo Yun. Long Pro LIP: A probabilistic vision-language model with long context text. ar Xiv preprint ar Xiv:2503.08048, 2025. 32

Sanghyuk Chun, Seong Joon Oh, Rafael Sampaio De Rezende, Yannis Kalantidis, and Diane Larlus. Probabilistic embeddings for cross-modal retrieval. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 1, 3, 5, 11, 19, 32

Sanghyuk Chun, Wonjae Kim, Song Park, Minsuk Chang Chang, and Seong Joon Oh. ECCV Caption: Correcting false negatives by collecting machine-and-human-verified image-caption associations for MS-COCO. In European Conference on Computer Vision (ECCV), 2022. 2, 11

M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi. Describing textures in the wild. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2014. 21

Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 215 223. JMLR Workshop and Conference Proceedings, 2011. 22

Karan Desai, Gaurav Kaul, Zubin Aysola, and Justin Johnson. Red Caps: Web-curated image-text data created by the people, for the people. In Neur IPS Dataset and Benchmark (Neur IPS D&B), 2021. 21, 23, 26

Karan Desai, Maximilian Nickel, Tanmay Rajpurohit, Justin Johnson, and Shanmukha Ramakrishna Vedantam. Hyperbolic image-text representations. In International Conference on Machine Learning (ICML), pp. 7694 7731. PMLR, 2023. 5

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018. 5, 7

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021. URL https://openreview.net/forum? id=Yicb Fd NTTy. 3, 6

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (VOC) challenge. International Journal of Computer Vision (IJCV), 88:303 338, 2010. 22

Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, and Vaishaal Shankar. Data filtering networks. In International Conference on Learning Representations (ICLR), 2024. 21, 25

Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pp. 178 178. IEEE, 2004. 21

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems (Neur IPS), 36, 2024. 2, 6, 7, 11, 21, 31

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354 3361. IEEE, 2012. 21

Published as a conference paper at ICLR 2025

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16000 16009, 2022. 5, 7

Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Euro SAT: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217 2226, 2019. 21

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-ofdistribution generalization. In International Conference on Computer Vision (ICCV), pp. 8340 8349, 2021a. 21

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15262 15271, 2021b. 21

Byeongho Heo, Sanghyuk Chun, Seong Joon Oh, Dongyoon Han, Sangdoo Yun, Gyuwan Kim, Youngjung Uh, and Jung-Woo Ha. Adam P: Slowing down the slowdown for momentum optimizers on scale-invariant weights. In International Conference on Learning Representations (ICLR), 2021. 21

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773. 2, 6, 11, 21

Yatai Ji, Junjie Wang, Yuan Gong, Lin Zhang, Yanru Zhu, Hongfa Wang, Jiaxing Zhang, Tetsuya Sakai, and Yujiu Yang. MAP: Multimodal uncertainty-aware vision-language pre-training model. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 23262 23271, 2023. 1, 3

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning (ICML), pp. 4904 4916. PMLR, 2021. 1

Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2901 2910, 2017. 21

Wonjae Kim, Sanghyuk Chun, Taekyung Kim, Dongyoon Han, and Sangdoo Yun. HYPE: Hyperbolic entailment filtering for underspecified images and texts. In European Conference on Computer Vision (ECCV), 2024. 5, 11

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015. 21

Michael Kirchhof, Enkelejda Kasneci, and Seong Joon Oh. Probabilistic contrastive learning recovers the correct aleatoric uncertainty of ambiguous inputs. In International Conference on Machine Learning (ICML), 2023. 3

Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009. 21

Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. 22

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022. 1

Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, and Kaiming He. Scaling language-image pre-training via masking. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 23390 23400, 2023. 5

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ar, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV), 2014. 22

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations (ICLR), 2018. URL https://openreview.net/forum?id=r Jz IBf ZAb. 31

Published as a conference paper at ICLR 2025

S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. Technical report, 2013. 22

Andrei Neculai, Yanbei Chen, and Zeynep Akata. Probabilistic compositional embeddings for multimodal image retrieval. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4547 4557, 2022. 3

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Baolin Wu, Andrew Y Ng, and Alessandro Bissacco. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, pp. 4. Granada, 2011. 21

Dat Quoc Nguyen, Ashutosh Modi, Stefan Thater, and Manfred Pinkal. A mixture model for learning multisense word embeddings. In Proc. of the 6th Joint Conference on Lexical and Computational Semantics (* SEM 2017), pp. 121 127, 2017. 3

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pp. 722 729. IEEE, 2008. 21

Seong Joon Oh, Kevin Murphy, Jiyan Pan, Joseph Roth, Florian Schroff, and Andrew Gallagher. Modeling uncertainty with hedged instance embedding. In International Conference on Learning Representations (ICLR), 2019. 3, 19

Jungin Park, Jiyoung Lee, Ig-Jae Kim, and Kwanghoon Sohn. Probabilistic representations for video contrastive learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14711 14721, 2022. 3

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3498 3505. IEEE, 2012. 21

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), pp. 8748 8763. PMLR, 2021. 1, 2, 3, 16, 22

Vikram V Ramaswamy, Sing Yu Lin, Dora Zhao, Aaron Adcock, Laurens van der Maaten, Deepti Ghadiyaram, and Olga Russakovsky. Geode: a geographically diverse evaluation dataset for object recognition. Advances in Neural Information Processing Systems (Neur IPS), 36, 2024. 22

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do Image Net classifiers generalize to Image Net? In International Conference on Machine Learning (ICML), 2019. 21

William A Gaviria Rojas, Sudnya Diamos, Keertan Ranjan Kini, David Kanter, Vijay Janapa Reddi, and Cody Coleman. The dollar street dataset: Images representing the geographic and socioeconomic diversity of the world. In Advances in Neural Information Processing Systems (Neur IPS), 2022. 22

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Image Net large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3):211 252, 2015. 21

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems (Neur IPS), 35:25278 25294, 2022. 21

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Association for Computational Linguistics (ACL), pp. 2556 2565, 2018. 8, 21, 23, 26

Yichun Shi and Anil K Jain. Probabilistic face embeddings. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6902 6911, 2019. 3

Anna Silnova, Niko Brummer, Johan Rohdin, Themos Stafylakis, and Lukas Burget. Probabilistic embeddings for speaker diarization. In Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, pp. 24 31, 2020. 3

Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural networks, 32:323 332, 2012. 22

Published as a conference paper at ICLR 2025

Jennifer J Sun, Jiaping Zhao, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, and Ting Liu. View-invariant probabilistic embedding for human pose. In European Conference on Computer Vision (ECCV), 2020. 3

Uddeshya Upadhyay, Shyamgopal Karthik, Massimiliano Mancini, and Zeynep Akata. Probvlm: Probabilistic adapter for frozen vison-language models. In International Conference on Computer Vision (ICCV), pp. 1899 1910, 2023. 1, 3

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (Neur IPS), pp. 5998 6008, 2017. 3, 6

Bastiaan S Veeling, Jasper Linmans, Jim Winkens, Taco Cohen, and Max Welling. Rotation equivariant cnns for digital pathology. In Medical Image Computing and Computer Assisted Intervention MICCAI 2018: 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part II 11, pp. 210 218. Springer, 2018. 21

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The Caltech-UCSD Birds200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011. 3

Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems (Neur IPS), pp. 10506 10518, 2019. 21

Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3485 3492. IEEE, 2010. 21

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Association for Computational Linguistics (ACL), 2:67 78, 2014. 21

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cut Mix: Regularization strategy to train strong classifiers with localizable features. In International Conference on Computer Vision (ICCV), 2019. 17

Sangdoo Yun, Seong Joon Oh, Byeongho Heo, Dongyoon Han, Junsuk Choe, and Sanghyuk Chun. Re-labeling imagenet: from single to multi-labels, from global to localized labels. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 30

Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, et al. A large-scale study of representation learning with the visual task adaptation benchmark. ar Xiv preprint ar Xiv:1910.04867, 2019. 21

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pretraining. In International Conference on Computer Vision (ICCV), pp. 11975 11986, 2023. 1, 4, 6, 7, 16, 21, 25

Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-clip: Unlocking the long-text capability of clip. In European Conference on Computer Vision (ECCV), pp. 310 325. Springer, 2024. 32

Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations (ICLR), 2018. 17

Published as a conference paper at ICLR 2025

A MORE DETAILS OF PROLIP

A.1 WHY DO WE NEED PROBABILISTIC REPRESENTATIONS?

Deterministic embeddings (e.g., CLIP (Radford et al., 2021), Sig LIP (Zhai et al., 2023)) suffer from the difficulty of representing input data uncertainty (i.e., aleatoric uncertainty). For example, A person is walking can be matched to either A person is walking in the rain or A person is walking on sunshine , but in a deterministic embedding space (Figure A.1a), it is challenging to represent the ambiguity of A person is walking . On the other hand, a probabilistic embedding space (Figure A.1b) can represent this ambiguity by expanding the area of the ambiguous embedding. If we assume more ambiguous input, such as person , a probabilistic embedding space can represent the ambiguous input by assigning a larger uncertainty value to person . However, a deterministic embedding space will map an input to a specific vector coordinate; one possible choice is to map the uncertain input into the average of the possible matched inputs, e.g., let the person embedding located to the midpoint of all the person-related text embeddings. However, this approach still cannot capture the complex actual semantic meaning of person ; it is more complex than the average embedding. This argument is empirically supported by the image traversal experiment in Section 4.4. Using a more proper text embedding as the root embedding [ROOT] performs better than using a native null text embedding. Furthermore, this paper targets the vision-language representation learning scenario, where the ambiguity of inputs is multi-modal; each modality (text and image) can have inherent ambiguity as shown in Figure A.1 and the correspondence between image and text can have ambiguity due to the inherent many-to-many correspondence and abundant false negatives as shown by Chun (2024). Overall, we need to use probabilistic representations rather than deterministic representations to represent the inherent ambiguity of vision-language datasets.

A person is walking

A person is walking in the rain

A person is walking on sunshine

A person hugging a cat

(a) Deterministic embedding (b) Probabilistic embedding

Figure A.1: Conceptual comparison between deterministic and probabilistic embedding spaces. While probabilistic representation space can naturally represent the inherent ambiguity of input data (i.e., aleatoric uncertainty) by estimating the uncertainty of each input, a deterministic embedding space can suffer from mapping complex semantics of ambiguous inputs.

What property should be learned by an ideal Pr VLM? An ideal Pr VLM should capture three potential input uncertainties: (1) uncertainty from the text modality, (2) uncertainty from the image modality, and (3) uncertainty from the text-image cross-modality. Uni-modal uncertainty is straightforward; if an input has more detailed information (e.g., describing more detailed information in text, or capturing a very detailed and complex scene by photography), then it will have smaller uncertainty, otherwise, it will have larger uncertainty (e.g., providing very high-level caption, such as person , or only a part object with white background is taken by picture).

The cross-modal uncertainty should capture how many possible instances can be matched to this input? . We can think this in two different viewpoints: text-to-image and image-to-text. The textto-image relationship is straightforward. If we have a caption photo , then it will be matched to all photographs, and if we have a caption with a very detailed description (e.g., the full description of the hotel room), then it will be only matched to specific images. Image-to-text relationships are often determined by the dataset. For example, if we have a caption dataset exactly describing

Published as a conference paper at ICLR 2025

which objects are in the image, then we can think that an image with less objects will have larger uncertainty. However, if we consider a caption dataset where an image has multiple captions, each caption focuses on completely different objects, then an image with more objects will have larger uncertainty. In other words, unlike text uncertainty originating from a text-to-image relationship, image uncertainty originating from an image-to-text relationship is highly affected by the property of the training dataset.

In practice, because our datasets have a mixed property and their captions are somewhat noisy, our image uncertainty will have a mixed property, namely, unlike text uncertainty, there could be no strong relationship between the absolute uncertainty value and the number of objects (or complexity of images) Empirically, we have more captions that exactly describing the scene, therefore, a more simple image tends to have larger uncertainty. Also, unlike texts, images always have the same number of pixels; the definition of information in images is not as straightforward as that of texts. Since there is no uni-modal uncertainty supervision, a Pr VLM trained only with image-text relationships will have no guarantee to represent proper image uni-modal uncertainty. To enforce a proper uni-modal image uncertainty, we first propose the uni-modal uncertainty supervision by proposing the inclusion loss, namely, the original image embedding should be included by an image embedding from the masked image.

A.2 2D VISUALIZATION FOR FIGURE 1

We use a linear projection, such as Principal Component Analysis (PCA), for the visualization. If we use a non-linear projection from a high-dimensional space to a 2-dimensional space, there is no guarantee that the projected Gaussian distribution still follows a Gaussian distribution. Therefore, we project the high-dimensional embeddings to 2-dimensional space using PCA. All models are trained on the Data Comp 1B dataset with 1.28B seen samples (i.e., the 1.28B models in Table 1).

For both models, we select ten similar images and their corresponding captions from the Hierar Caps dataset. Then, for Pro LIP, we randomly sample ten embeddings for each embedding using the learned Gaussian distributions. We then apply PCA to the sampled embeddings for visualization.

A.3 PROBABILISTIC MATCHING LOSS VS. PROBABILISTIC PAIRWISE CONTRASTIVE LOSS

Probabilistic matching loss (PML) by PCME++ (Chun, 2024) and the proposed probabilistic pairwise contrastive loss (PPCL) use almost the same logit based on CSD (Equation (1)). However, they have differences in (1) PML is based on the original CSD, which should compute L2-distance between µs and (2) PML uses binary cross entropy loss (BCE) while PPCL uses log sigmoid loss.

The first difference can make numerical errors when the difference between µs is extremely small. It is because L2 distance should compute a square operation, i.e., PD i=1(a2 i b2 i )2, where ai and bi are scalar value of i-th dimension of vector a and b. On the other hand, we use logit based on matrix multiplication µ 1 µ2 (derived in Appendix A.4), showing a more stable and accurate computation in terms of float precision.

The second difference can cause a fast gradient vanishing as already discussed by Chun (2024). Chun (2024) thus employed additional techniques to mitigate the issue, e.g., pseudo positives and mixed data sample augmentation method, such as Mixup (Zhang et al., 2018) and Cut Mix (Yun et al., 2019). However, in this paper, we omit the techniques because simplicity is important when we train with large-scale training samples. To tackle the issue, we employ log sigmoid loss and a multiplication-based logit, resulting in a fast and stable convergence.

In our ablation study (Appendix C.2), we empirically show that the PML loss used by PCME++ fails to be converged when it is used by a stand-alone way without any deterministic loss (e.g., CLIP loss). On the other hand, PPCL loss converges well even without any additional loss function.

Note that PCME++ and Pro LIP use normalized mean for both training and inference to compute cosine similarity as same as CLIP. Also, PCME++ and Pro LIP can estimate a Gaussian random variable by the parameterization trick, i.e., Z = (µ, Σ) = µ + Σ (0, 1) (where, denotes the element-wise multiplication). In practice, because we use the closed-form solution for calculating distance (CSD Equation (1)) and loss functions (inclusion loss, PPCL), no sampling based on a re-parameterization trick is required.

Published as a conference paper at ICLR 2025

A.4 DERIVATION OF EQUATION (2)

We first start from the fact that d L2 = µ1 µ2 2 2 = 2 2µ 1 µ2 = 2 2 d L2cos when µ1 and µ2 are L2-normalized (i.e., µ 2 2 = 1). Namely, a cosine similarity d L2cos = 1 1

2d L2. By replacing d L2 to d CSD (Equation (1)), we can conclude the derivation.

d CSDcos = 1 1

2d CSD = 1 1

2tr(σ2 v + σ2 t ) = d L2cos 1

2tr(Σv + Σt). (A.1)

A.5 DERIVATION OF EQUATION (3)

Let us assume two Gaussian distributions:

2πσ2 1 exp (x µ1)2

, q(x) = 1 p

2πσ2 2 exp (x µ2)2

First, we take the square of p(x):

2πσ2 1 exp (x µ1)2

= 1 2πσ2 1 exp (x µ1)2

Now, we compute the integral: Z p(x)2q(x)dx = Z

1 2πσ2 1 exp (x µ1)2

2πσ2 2 exp (x µ2)2

exp (x µ1)2

σ2 1 (x µ2)2

We expand the terms in the exponent as follows:

σ2 1 (x µ2)2

σ2 1 + 1 2σ2 2

x µ2 1 σ2 1 + µ2 2 2σ2 2

σ2 1 + 1 2σ2 2 , B = 2µ1

σ2 2 , C = µ2 1 σ2 1 + µ2 2 2σ2 2 The term in the exponent simplifies to:

Now, the integral becomes:

2πσ2 2 exp B2

Using the Gaussian integral formula, we have:

Thus, the integral becomes: Z

p(x)2q(x) dx = 1

2πσ2 2 exp B2

By taking logarithm and omitting constant terms, we have Equation (3).

Published as a conference paper at ICLR 2025

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0

3.5 pdf of Z1 = N( 1, 0.05)

pdf of Z2 = N(1, 0.3) inc(Z1, Z2) = 10.530 KL(Z1, Z2) = 40.889 KL(Z1, Z2) KL(Z2, Z1) = 1598.695

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.0

3.5 pdf of Z1 = N( 1, 0.05)

pdf of Z2 = N(1, 0.3) inc(Z2, Z1) = 33.974 KL(Z2, Z1) = 1639.584 KL(Z2, Z1) KL(Z1, Z2) = 1598.695

Figure A.2: More visual examples of the inclusion loss. The difference of KL and reverse KL (DKL) cannot correctly capture the inclusion relationship.

A.6 MORE DISCUSSIONS RELATED TO INCLUSION LOSS

Figure A.2 shows more comparisons with the proposed inclusion loss and KL divergence. Specifically, we compare inclusion loss with the difference of KL and reverse KL (DKL), namely, DKL(Z1, Z2) = KL(Z1, Z2) KL(Z2, Z1). As shown in the figure, the most significant problem with DKL is that it cannot correctly represent the inclusion relationship. For example, consider two random variables Z1 and Z2 where they do not include each other as shown in Figure A.2. In this case, although Z1 and Z2 do not include each other, because DKL(Z1, Z2) = DKL(Z1, Z2), one of DKL(Z1, Z2) or DKL(Z2, Z1) will become positive, while the other will become negative. Namely, DKL cannot correctly capture the inclusion relationship. On the other hand, in the same scenario, our inclusion test always returns negative values as shown in the figure, which correctly represents the inclusion relationship.

A.7 VIB LOSS

VIB loss is a simple regularization term to prevent the collapse of the estimated variance and is widely used by previous probabilistic representation learning methods Oh et al. (2019); Chun et al. (2021); Chun (2024). The VIB loss formulation is as follows:

LV IB(Z) = KL(Z N(0, 1)) = 1

2(1 + log σ2 µ2 σ2). (A.2)

Please refer to Oh et al. (2019) for the full derivation.

A.8 ALGORITHM OF UNCERTAINTY-AWARE ZERO-SHOT PROMPT SELECTION

Remark: Let πc RN be the weight of each prompt, where N is the number of prompts (i.e., 80 for Image Net) and c is the class index. Our goal is to find the best πi that the new text embedding Znew t = P

i πi c Zi t describe the given M image embeddings Zj v. To achieve this goal, we optimize πc to have the best posterior for Zt and Zv. For each class c, we sample K points from each Zj v. Namely, we have total M = M K point embeddings as observations for class c.

We assume a Dirichlet prior for πc with α, where α controls the uniformity of the posterior. If we choose large α, such as 10, then π will be more uniform and if we choose small α, such as 0.1, then π almost becomes a Direc-delta distribution. We also set the initial π with the reversed uncertainty score, i.e., the normalized 1/tr(Σt) for each prompt. We also employ a small trick for stability. After we got the prior value, we add a small ε to Σt to make the overall operation stable. Now, we explain the Expectation-Maximization (EM) algorithm for estimating the mixing proportions of a mixture of N Gaussian components, incorporating a Dirichlet prior over the mixing proportions. For simplicity, we omit the class index c for the remaining section.

We place a Dirichlet prior on the mixing proportions for the given α:

p(π) = Dirichlet(π | α), (A.3)

We have a dataset of M observations {xj}M j=1 and assume that the data is generated from a mixture of N Gaussian distributions (i.e., the number of prompts).

Published as a conference paper at ICLR 2025

E-Step: Compute the responsibilities γjn, which represent the probability that observation xj was generated by component n:

γjn = πn fn(xj)

i=1 πi fi(xj)

where fn(xj) is the Gaussian probability density function of component n evaluated at xj:

fn(xj) = 1 (2π)D/2|Σn|1/2 exp 1

2(xj µn) Σ 1 n (xj µn) .

Since Σn is diagonal, the computation simplifies to:

2πσ2 nd exp (xjd µnd)2

M-Step: Compute the effective number of observations assigned to each component:

Then, update the mixing proportions for each prompt n:

πn = Nn + αn 1

M + PN i=1(αi 1) = Nn + α 1 M + N(α 1).

Ensure that πn 0 and PN n=1 πn = 1. We can simplify the algorithm as follows:

Algorithm 1 Bayesian Prompt Re-Weighting (BPRW)

1: Initialization (α: the Dirichlet prior hyperparameter, ε: the stability hyperparameter) 2: for each class c = 1 to C do 3: Collect M observations by choosing M number of Zv and sample K points from each Zv. We can choose M ground truth observations where Zv is the embedding, including the c class. Otherwise, we select M nearest samples from the base text prompt embedding of c. 4: for each prompt index k = 1 to K do

5: Set initial mixing proportion: πn 1 T r(Σn) / PN n=1 1 T r(Σn) .

6: Modify covariance matrix for stability: Σk Σk + ε I. 7: end for 8: repeat 9: // E-Step: 10: for each observation j = 1 to M do 11: for each prompt index n = 1 to N do

12: Compute the responsibility γjn = ( πn fn(xj) ) / PN i=1 πi fi(xj) .

13: end for 14: end for 15: // M-Step: 16: for each prompt index n = 1 to N do 17: Compute Nn = PN j=1 γjn

18: Update mixing proportion πn = Nn + α 1 M + N(α 1) 19: end for 20: until convergence 21: end for

Published as a conference paper at ICLR 2025

B MORE EXPERIMENTAL DETAILS

B.1 HYPERPARAMETERS

We use the Adam W (Kingma & Ba, 2015) optimizer following the official openclip implementation. We also tried Adam P (Heo et al., 2021), which was the main optimizer of PCME++ (Chun, 2024). We empirically observe that Adam P can improve the overall performances (e.g., average performance from 57.3 to 57.8 under the same setting), but for a fair comparison, we follow the protocol of openclip (Ilharco et al., 2021). We use a learning rate of 0.0005, beta1 of 0.9, beta2 of 0.95, weight decay of 0.2, and batch size of 512 for each GPU (i.e., the full batch size is 512 32 = 16384). We apply 10000 warmup steps and then the learning rate is decayed by cosine learning rate scheduling. We use image augmentations of scaling 0.8 to 1.0, color jittering and grayscale. Among the mini-batch, we select 12.5% image-text pairs and drop 75% tokens of them to compute Linc(x xmasked). In all experiments, we fix α1 and α2 in Equation (6) as 10 7 and 0.001, respectively. We tried various combinations of α, but we found that the final result is relatively robust to the choice of α unless we choose very large α. Similarly, we tested various ε and c for the inclusion loss (Equation (5)), but as shown in Table C.5, the final result is relatively robust to the choice of the parameters. If not specified, we chose ε = 20 and c =1000, but for future research, we recommend using ε = 100 and c =10, which empirically performs well and stable large-scale training.

B.2 FINE-TUNING DETAILS

During the project, we indeed aspire to train large models, such as Vi T-H, Vi T-SO400M, Vi T-G or Vi T-g. However, as reported by Cherti et al. (2023), Vi T-H/14 CLIP model with 34B seen samples (achieving 78.0% Image Net zero-shot accuracy) task 279 hours with 824 A100s. Using our resource (32 H100s), training the same backbone from scratch takes almost 300 days. For this reason, we tried to fine-tune the pre-trained strong backbones. Note that our goal is not to fine-tune the existing deterministic VLMs, but we report four fine-tuning results, Vi T-B/16 (76.0% IN-ZS), Vi T-L/16 (80.5% IN-ZS) and Vi T-SO400M/14 (82.0% IN-ZS) pre-trained by Sig LIP (Zhai et al., 2023) and Vi T-H/14 (83.4% IN-ZS) pre-trained by DFN (Fang et al., 2024), for achieving a stronger Pr VLM.

For the architectural consistency between all Pro LIP models, we remove the attention pooling of pre-trained Sig LIP and fine-tune the models using [CLS] token-based pooling and fix the image resolution to 224 224. This will lead to slightly worse performance of the fine-tuned models than the original models. Note that we can implement both attention pooling and [UNC] token architecture simultaneously, but we did not implement the architecture for simplicity. We again emphasize that we do not aim to achieve the state-of-the-art performance, instead, our goal is to verify whether Pro LIP performs better by scaling up the architecture beyond Vi T-B/16. For the fine-tuning, we use the learning rate as 5.0e-5, the weight decay as 0.0, and the number of seen samples as 1.28B.

B.3 TRAINING AND EVALUATION DATASETS

We mainly use the Data Comp 1B dataset (Gadre et al., 2024), a filtered version of the LAION-5B dataset (Schuhmann et al., 2022), as our training dataset. We had 1,121,356,767 number of valid URLs among 1.39 billion URLs and 1,118,443,492 number of unique images after de-duplicating URLs (there were 147,676,246 number of duplicated URLs in the Data Comp 1B URLs). For the ablation study, we use Conceptual Caption 3M (Sharma et al., 2018), Conceptual Caption 12M (Changpinyo et al., 2021) and Red Caps (Desai et al., 2021) datasets.

We use 38 tasks from the Data Comp evaluation suite: Image Net (Russakovsky et al., 2015), 6 benchmarks for evaluating robustness under Image Net distribution shifts, including Image Net-A, Image Net-O (Hendrycks et al., 2021b), Image Net-R (Hendrycks et al., 2021a), Image Net v2 (Recht et al., 2019), Image Net-Sketch (Wang et al., 2019) and Object Net (Barbu et al., 2019), and 13 VTAB task (Zhai et al., 2019), including Caltech-101 (Fei-Fei et al., 2004), CIFAR-100 (Krizhevsky, 2009), CLEVR Counts, CLEVR Distance (Johnson et al., 2017), Describable Textures (Cimpoi et al., 2014), Euro SAT (Helber et al., 2019), KITTI Vehicle Distance (Geiger et al., 2012), Oxford Flowers-102 (Nilsback & Zisserman, 2008), Oxford-IIIT Pet (Parkhi et al., 2012), Patch Camelyon (Veeling et al., 2018), RESISC45 (Cheng et al., 2017), SVHN (Netzer et al., 2011) and SUN397 (Xiao et al., 2010), and 3 retrieval tasks, including Flickr (Young et al., 2014), MS-COCO Caption (Chen et al., 2015) and Wino GAVi L (Bitton et al., 2022). There are also 13 additional tasks,

Published as a conference paper at ICLR 2025

such as CIFAR-10, Country211 (Radford et al., 2021), FGVC Aircraft (Maji et al., 2013), Food-101 (Bossard et al., 2014), GTSRB (Stallkamp et al., 2012), MNIST (Le Cun et al., 1998), Pascal VOC (Everingham et al., 2010), Rendered SST2 (Radford et al., 2021), STL-10 (Coates et al., 2011), i Wild Cam (Beery et al., 2021), FMo W (Christie et al., 2018), Dollar Street (Rojas et al., 2022), and Geo DE (Ramaswamy et al., 2024).

For performing zero-shot classification, we use CSD as the distance between the given image and the pre-extracted text template features representing each class. As CSD uses uncertainty (Equation (1), this protocol uses uncertainty value for the zero-shot classification.

B.4 HIERARCAPS AND IMAGE TRAVERSAL DETAILS

The Hierar Caps dataset (Alper & Averbuch-Elor, 2024) is a human-validated caption dataset where each image has four levels of captions. For example, water sports kite surfing kite surfer on top of the board kite surfer in the air on top of a red board . These captions are humanvalidated, namely, the level of the Hierar Caps dataset is aligned to the hierarchical perception of humans. As shown in Figure 7, the human-validated level of Hierar Caps (i.e., human intuitions) is aligned well to the learned uncertainty by Pro LIP.

We also showed that the learned uncertainty can improve the image traversal task. The traversal task needs two information: the closed text embedding and [ROOT] embedding for the given image. Once the closed text embedding and [ROOT] embedding are chosen, we interpolate them with 50 equally spaced steps and find the closest text from the database for each interpolated caption. Uncertainty information is used for the traversal task in two perspectives. First, we use CSD to retrieve captions, where CSD uses the uncertainty (Equation (1)). More importantly, we estimate the [ROOT] embedding using uncertainty. Previous approaches use the average text embedding or null text embedding as [ROOT] embedding. Instead of using average or null text embedding, we propose to use uncertainty-based [ROOT] embedding. As clarified in Section 4.4, we first retrieve the most similar caption of the given image (following the common protocol (Alper & Averbuch-Elor, 2024)). Then, we search for the most inclusive caption of the retrieved caption in the database using the inclusion measure that we proposed. Then, we use the most inclusive caption as the [ROOT] embedding, which is a more plausible root compared to the average or null embedding.

As shown in Figure B.1, the [ROOT] embedding of CLIP is always closest to the umbrella , which makes the image traversal by CLIP inaccurate. On the other hand, the [ROOT] embedding of Pro LIP correctly estimates the true hierarchy of the given image query.

B.5 HIERARCHICAL IMAGES DATASET CONSTRUCTION

We construct the hierarchical image dataset by using the validation set of the COCO dataset (Lin et al., 2014), which includes both images and their corresponding segmentation maps. First, the images were filtered to retain only those where the smallest segment was larger than 5000 pixels. Then, the three largest segments were extracted from each image and pasted onto a blank canvas in descending order of size. We illustrate the generated samples in Figure B.2.

C MORE RESULTS

C.1 UNCERTAIN AND LONG CAPTIONS

The following single caption shows a very high uncertainty value despite its length (e.g., top 20% uncertain samples in Figure 4). Note that Figure 6 shows that a caption with longer length might have a higher uncertainty, but it is not always true if the caption only has almost useless information. Below, UNC denotes the uncertainty value of the corresponding caption, namely tr(Σt).

UNC: 0.0415 Searching for Meaning: Idealism, Bright Minds, Disillusionment, and Hope (Third in a Series of See Jane Win(tm) Books) Cover Image UNC: 0.0415 Graphic design profession workdesk monitor printer books lamp pc computer stock illustration UNC: 0.0416 Photo ID: 2299331 Views: 14221 UK - Air Force Eurofighter EF-2000 Typhoon FGR4 (ZK306) shot at Fairford (FFD / EGVA) UK - England July 21, 2013 By Michael Brazier-Aviation-Images UNC: 0.0416 Vigo 36 inch Farmhouse Apron Single Bowl 16 Gauge Stainless Steel Kitchen Sink with Zurich Chrome Faucet, Grid, Strainer and Soap Dispenser UNC: 0.0418 XIAOMI i Health PT-101B Medical Baby High Sensitivity LED Electric Thermometer Underarm/Oral Soft Head Thermometer Adult Baby Tthermometer Sensor UNC: 0.0462 Maidofhonortoastus Scenic Sample Invoices Created With Our Online Invoicing Software With Licious Sample Invoice Template With Comely Free Excel Invoice Also Invoice Template For Self Employed In Addition Used Vehicle Invoice And Invoice Payment Reminder As Well As Print Invoice Template Additionally It Services Invoice Template From Invoiceberrycom With Maidofhonortoastus Licious Sample Invoices Created With Our Online Invoicing Software With Comely

Published as a conference paper at ICLR 2025

Top-1 Retrieved captions

Query images

Missing Captions

man - opens pizza box man - opens pizza box

train on tracks

train on tracks - pizza - opened - elderly woman elderly - red chair -

Pro LIP CLIP Pro LIP CLIP Pro LIP CLIP Pro LIP CLIP

elder woman sitting with a crowd behind her

elderly woman

pizza - opened

train on the tracks in the middle of the field

train on the tracks in the middle of the field

Small train running down the tracks in the middle of a field

A man opens pizza box and stares at the camera

A man opens pizza box and stares at the camera

An elderly woman sits on a bench with a crowd behind her

Small train running down the tracks in the middle of a field

elder woman sitting with a crowd behind her

An elderly woman sits on a bench with a crowd behind her

red chair has a horse statue

red chair has a horse statue

A large red chair has a horse statue on it

A large red chair has a horse statue on it

train umbrella umbrella elderly

Figure B.1: Image traversal comparison between Pro LIP and CLIP. The image traversal examples on Hierar Caps (Alper & Averbuch-Elor, 2024). The highlighted captions denote the wrong captions.

Sample Invoice Template And Scenic Free Excel Invoice Also Invoice Template For Self Employed In Addition Used Vehicle Invoice From Invoiceberrycom 0. UNC: 0.0462 Carterusaus Ravishing Invoice Freewordtemplatesnet With Inspiring Proforma Invoice With Easy On The Eye Chicken Receipts Also I Receipt Notice In Addition Cash Receipts Template And Read Receipt For Gmail As Well As Annual Gross Receipts Additionally Platepass Receipt From Freewordtemplatesnet With Carterusaus Inspiring Invoice Freewordtemplatesnet With Easy On The Eye Proforma Invoice And Ravishing Chicken Receipts Also I Receipt Notice In Addition Cash Receipts Template From Freewordtemplatesnet UNC: 0.0466 Aldiablosus Ravishing Addition Worksheets Dynamically Created Addition Worksheets With Entrancing Addition Worksheets With Appealing Phonics Worksheets Free Also Functions Worksheet Algebra In Addition Cellular Respiration Worksheet Middle School And Beginning Addition Worksheets As Well As Real World Math Worksheets Additionally Chemical Change Worksheet From Mathaidscom With Aldiablosus Entrancing Addition Worksheets Dynamically Created Addition Worksheets With Appealing Addition Worksheets And Ravishing Phonics Worksheets Free Also Functions Worksheet Algebra In Addition Cellular Respiration Worksheet Middle School From Mathaidscom

C.2 MORE ZERO-SHOT CLASSIFICATION EXPERIMENTS

We show the full results of 38 tasks in Table C.1. We can observe that in most benchmarks, Pro LIP outperforms CLIP and Sig LIP. We also fine-tune the pre-trained Sig LIP or CLIP models with our probabilistic training strategy. As we clarified in Appendix B.2, these models were slightly modified due to the architectural difference between the original pre-trained model and our main Pro LIP model (e.g., attention pooling vs. [CLS] token pooling, and image resolution). Table C.2 shows that the overall performance is improved by increasing the parameter size (from 62.1 to 66.9).

C.3 ABLATION STUDY

In this subsection, we conduct the ablation study of our design choices. We train the models on Conceptual Caption 3M (Sharma et al., 2018), Conceptual Caption 12M (Changpinyo et al., 2021) and Red Caps (Desai et al., 2021) datasets with 96M seen samples. This efficient setting enables to train a model in 7 hours with 8 A100 NVIDIA GPUs. We measure the effectiveness of our design choice in three categories. First, we measure Image Net zero-shot top-1 accuracy (IN-Top1). Second, we measure the average σ2 values of visual and textual modalities. If a model captures the inherent uncertainty well, we assume that it will have a higher uncertainty for captions, rather than images. Finally, we measure the Hierar Caps recall, to measure whether the learned uncertainty captures the hierarchy of captions well.

Published as a conference paper at ICLR 2025

Figure B.2: Hierarchical COCO samples.

Table C.3 shows that (1) PCME++ loss is not trainable when we use more limited architecture than multi-head architecture for estimating uncertainty. It supports our assumption discussed in Appendix A.3. (2) PCME++ loss becomes learnable if we use deterministic loss together. (3) Linc(v t) enforces images to belong into texts, namely image uncertainty tends to be smaller than text uncertainty with the inclusion loss. (4) Linc(x xmasked) makes the embedding space can capture the hierarchy of data, namely, improves Hierar Caps recall.

Although Table C.3 shows a good intuition for each loss function, we remark that the learned uncertainty might need more training time to be more accurate. From this observation, we also compare Pro LIP models by ablating the main losses on Data Comp 1B with 1.28B seen samples. Table C.4 shows that using all proposed objectives achieves the best Image Net and VTAB results. Although it shows slightly worse performance in retrieval, we found that its performance is reasonably good compared to other variants, especially for the baseline (first row) in all measurements.

Table C.5 shows the impact of the choice of the stability parameters for inclusion loss. As we mentioned in Appendix B.1, the final performance is relatively robust to the choice of hyperparameters. In practice, we recommend ε = 100 and c = 10 for both stable training and performance.

We also compare the efficiency of the proposed [UNC] token architecture compared to the deterministic baseline and multi-head uncertainty estimate module by PCME++. We compare them with five different architectures (Vi T-B/32, Vi T-B/16, Vi T-B/16 with 768-width, Vi T-L/16, and Vi T-L/14). Here, we only compare image encoders because image encoders always take the same image token length, which makes it easier to compare the impact of different design choices. We report (1) the input image token sizes for each architecture ( [UNC] token architecture uses one more), (2) the base parameters of the deterministic one, (3) inference time for 50k Image Net validation images (lower is faster), and (4) additional parameters compared to the deterministic baseline.

Table C.6 shows three findings. First, even if we increase the input token length, the actual inference speed is not quadratically slow Vi T-B/32 and Vi T-B/16 have the same Transformer capability but only input lengths are different (49 vs. 196), but their inference times are 75.18s and 76.68s, which is an almost neglectable change. If we choose a larger model (e.g., Vi T-L), the difference

Published as a conference paper at ICLR 2025

Table C.1: Zero-shot classification full results. FT denotes the fine-tuned results of the pre-trained models with deterministic objectives. Details of fine-tuning can be found in Appendix B.2.

Caltech-101

CLEVR Counts

CLEVR Distance

Describable Textures

FGVC Aircraft

Image Net 1k

Image Net Sketch

Image Net v2

Image Net-A

Image Net-O

Image Net-R

KITTI Vehicle Distance

1.28B seen samples

CLIP 91.75 94.34 77.85 18.01 15.79 15.63 60.43 45.09 20.77 86.55 50.27 67.24 55.70 58.95 30.73 53.70 76.35 30.80 72.96 55.39 Sig LIP 92.77 94.17 78.25 20.11 15.87 15.18 60.90 39.15 21.01 85.96 40.71 67.37 55.93 59.67 30.80 54.25 76.41 18.00 73.17 55.60 Pro LIP 92.43 93.97 77.48 22.38 15.51 15.29 61.28 40.17 21.78 85.29 51.32 67.76 56.34 59.91 30.92 53.65 76.24 39.10 74.78 54.94 Pro LIP (Vi T-B FT) 93.69 96.39 83.13 21.03 15.17 18.35 65.32 57.28 40.31 90.55 47.39 74.60 64.78 66.03 46.81 43.55 86.63 27.29 84.33 65.44 Pro LIP (Vi T-L FT) 94.58 98.28 87.79 23.31 16.27 26.56 70.53 71.26 49.91 94.20 52.76 79.40 70.28 72.77 67.67 33.45 92.26 19.97 85.05 75.37 Pro LIP (SO400M FT) 94.29 98.17 87.92 23.02 16.69 29.75 70.27 64.96 46.11 94.27 57.22 79.26 69.74 72.55 70.05 33.15 92.32 32.91 74.22 76.00 Pro LIP (Vi T-H FT) 95.05 98.29 87.54 23.11 22.90 30.64 70.64 59.80 49.85 94.64 61.91 79.41 69.61 72.83 67.97 33.00 92.07 16.03 87.35 74.43

Pro LIP (12.8B) 93.61 96.42 83.25 29.78 15.13 21.29 66.91 60.89 38.02 91.03 52.79 74.56 63.65 66.66 50.25 45.40 86.00 32.21 84.47 65.80

Oxford Flowers-102

Oxford-IIIT Pet

Pascal VOC 2007

Patch Camelyon

Rendered SST2

Stanford Cars

Wino GAVi L

Dollar Street

CLIP 69.67 88.57 81.38 56.67 50.08 58.57 83.09 96.78 66.43 59.68 11.16 56.03 10.57 71.09 45.49 43.46 56.43 86.97 57.12 Sig LIP 68.62 88.31 81.46 56.66 49.20 55.62 83.82 96.23 67.60 61.70 10.07 63.84 11.69 71.76 45.47 43.09 57.48 87.56 56.72 Pro LIP 71.16 89.05 80.03 65.76 49.75 57.75 85.66 96.41 65.67 62.60 10.13 64.34 8.99 71.13 45.73 42.12 56.78 86.97 57.91 Pro LIP (Vi T-B FT) 79.76 93.49 79.37 55.87 55.68 65.30 91.21 98.26 70.26 68.15 15.28 63.47 11.15 79.00 51.71 44.32 61.80 88.81 62.13 Pro LIP (Vi T-L FT) 83.24 95.27 81.13 58.07 57.66 71.59 93.72 99.02 73.97 66.53 16.23 66.76 18.24 82.26 55.76 45.86 67.06 91.15 65.93 Pro LIP (SO400M FT) 84.18 95.55 82.58 58.54 66.78 74.78 93.20 98.84 74.25 68.60 16.66 67.29 23.18 83.46 56.61 47.35 66.36 91.02 66.63 Pro LIP (Vi T-H FT) 82.78 95.64 82.82 64.25 59.80 72.94 94.62 99.12 75.08 71.47 13.40 79.91 20.56 81.55 55.82 47.38 66.24 91.73 66.90

Pro LIP (12.8B) 78.37 93.45 81.74 61.41 54.31 68.27 91.33 97.91 71.35 72.77 12.64 57.85 15.12 79.97 53.18 45.56 62.27 90.31 63.31

Table C.2: Comparison of fine-tuned Pro LIP with 1.28B seen samples. Vi T-H/14 is based on CLIP by DFN (Fang et al., 2024), while the other models are based on Sig LIP (Zhai et al., 2023). FLOps are for the base pre-trained models, not modified architecture by Pro LIP.

Backbone FLOps # Samples Seen Image Net IN dist. shifts VTAB Retrieval Average

Vi T-B/16 44.44G 1.28B* 74.6 62.2 61.2 58.3 62.1 Vi T-L/16 136.41G 1.28B* 79.4 68.6 64.0 61.3 65.9 Vi T-SO400M/14 233.54G 1.28B* 79.3 69.0 65.1 62.5 66.6 Vi T-H/14 381.68G 1.28B* 79.4 68.3 64.4 61.6 66.9

becomes larger, e.g., 137.77s vs. 176.95s, but it is still not a quadratic order. Second, [UNC] token adds almost neglectable parameters (0.3M for B and 0.6M for L) and inference time compared to the deterministic one. Finally, the multi-head architecture needs a large number of additional parameters (e.g., 20M for B and 40M for L) and shows slower inference time, especially for a larger network (e.g., 176.95s vs. 191.68s for Vi T-L/14). Furthermore, in practice, multi-head architecture requires more memory than [UNC] token, which makes it difficult to use a large batch size and scale up to a larger backbone. On the other hand, [UNC] token only needs almost neglectable additional parameters, inference speed, and memory size, which makes it easier to scale up.

C.4 MORE VISUAL EXAMPLES

We visualize more samples with combinations of various image and text uncertainties. Figure C.1 shows the example image-text pairs with their uncertainty values and similarity score measured by CSD (Equation (1)). The results are similar to Figure 5.

C.5 MORE RESULTS FOR HIERARIMGS

Unlike text uncertainty (Figure 7), we observe that the visual uncertainty values are not discriminative by the levels in contrast to text uncertainty See Figure C.2a. Lower-level images tend to have a larger average uncertainty (0.015) than the original images (0.014), but their differences are not

Published as a conference paper at ICLR 2025

Table C.3: Ablation study. All models are trained on Conceptual Caption 3M (Sharma et al., 2018), 12M (Changpinyo et al., 2021) and Red Caps (Desai et al., 2021) where the number of seen samples is 96M.

Loss Unc Arch Linc(v t) Linc(x xmasked) IN-Top1 avg σ2 v avg σ2 t H.Cap Recall

CLIP - - - 35.5 - - 48.4 PCME++ + CLIP multi-head 33.6 0.0160 0.0077 55.2 PCME++ [UNC] 0.1 0.0354 0.0347 - PCME++ + CLIP [UNC] 36.1 0.2552 0.2118 53.8 Pro LIP [UNC] 37.4 0.3276 0.0745 44.8 Pro LIP [UNC] 36.8 0.0076 0.2324 46.7 Pro LIP [UNC] 37.5 0.3319 0.0610 47.9 Pro LIP [UNC] 37.0 0.0086 0.2254 54.8

Table C.4: Large-scale ablation. All models are Vi T-B/16 trained on Data Comp 1B with 1.28B seen samples.

Linc(v t) Linc(x xmasked) Image Net IN dist. shifts VTAB Retrieval Average

67.0 54.6 56.2 53.6 56.6 67.3 54.6 56.4 53.2 57.0 67.4 54.4 56.4 53.2 56.7 67.6 55.0 57.1 53.4 57.3

Table C.5: Impact of ε and c for the inclusion loss. Details are the same as Table C.4.

ε c Image Net IN dist. shifts VTAB Retrieval Average

-20 1000 67.6 55.0 57.1 53.4 57.3 -10 1000 67.4 55.1 57.3 53.1 57.1 -5 1000 67.7 55.2 55.9 52.9 56.6 -5 100 67.8 55.3 56.7 53.0 57.5 -5 10 68.0 55.5 56.8 53.6 57.4 -10 10 67.8 55.3 58.5 53.0 57.9 -100 10 67.7 55.5 56.9 53.7 57.5

significant between levels as texts. Instead of plotting every image in the same histogram, we plot the difference of uncertainty between the original image and its maksed versions in Figure C.2 (b-d).

To understand why some images have reversed image uncertainty by their hierarchy, we visualize the images whose original image is not included by level 0 images. Interestingly, as shown in Figure C.3, we can observe that many images with improper uncertainty estimate by hierarchy actually have wrong visual semantic hierarchy with severely occluded main objects. For example, in the upper row image, the dog in the image appears at the second level, but it only reveals its part rather than the whole body. These results show that proper filtering on the Hierar Imgs dataset should be required for a more reliable evaluation.

C.6 MORE DISCUSSIONS FOR HUMAN PREFERENCE AND LEARNED UNCERTAINTY

0 1 2 3 Target caption level

Source caption level

0.782 0.999 1.000

0.997 1.000

Does level-{source} include level-{target}?

Figure C.4: The uncertainty value comparison between different levels.

In Figure 7, we show that the learned uncertainty by Pro LIP is well separable by the hierarchy of Hierar Caps which is validated by humans. Namely, a Hierar Caps caption quadruple from level 0 (the most abstract one) to level 3 (the most detailed one) has an inclusion relationship verified by humans. From this observation, we can conduct a virtual human study to determine whether the learned uncertainty correctly captures human preference.

First, we measure how consistently the numbers in each Hierar Caps caption quadruple are ordered decreasingly. Namely, we compute the following metric: 1 T

t T I(tr(Σi t)>tr(Σi+1 t )), (C.1)

Published as a conference paper at ICLR 2025

Text Uncertainty

Image Uncertainty

Figure C.1: Samples with high/low image/text uncertainty. Samples are drawn from the Data Comp small subset. We also report CSD (similarity, lower is closer) between the pair, σ2 v, and σ2 t (lower is certain).

where I denotes the indicator function and Σi t denotes the uncertainty value of the level i of the caption t. Using Vi T-B/16 Pro LIP model with 12.8B seen samples, 90.0% of adjacent uncertainties satisfy decreasing order.

Second, as similar to Figure C.2, we show the uncertainty value comparison between different levels in Figure C.4. In the figure, we can observe that most of the level 3 (all) and 2 ( 0.997) captions have smaller uncertainty values than their corresponding level 0 and level 1 captions. Also, large number of level 3 captions (0.923) have smaller uncertainty values than their level 2 captions. We

Published as a conference paper at ICLR 2025

0.010 0.015 0.020 0.025 0.030 Hypothesis value

level 0 level 1 level 2 level 3 (orig)

(a) All image uncertainties by hierarchy

0.015 0.010 0.005 0.000 0.005 0.010 Image uncertainty

70.6% samples become more uncertain

2 v(0) 2 v(orig)

(b) σ2 v difference of original and level 0 images

0.015 0.010 0.005 0.000 0.005 0.010 Image uncertainty

75.3% samples become more uncertain

2 v(1) 2 v(orig)

(c) σ2 v difference of original and level 1 images

0.010 0.005 0.000 0.005 0.010 Image uncertainty

76.6% samples become more uncertain

2 v(2) 2 v(orig)

(d) σ2 v difference of original and level 2 images

Figure C.2: Hierar Imgs σ2 v statistics.

Figure C.3: Examples of Hierar Imgs when Pro LIP fails to capture σ2 v by hierarchy.

Published as a conference paper at ICLR 2025

Table C.6: Uncertainty architecture comparisons. The total inference speeds (in seconds) of each architecture for 50k Image Net validation images are shown (lower is better). The numbers in parentheses denote the number of additional parameters compared to the deterministic baseline model.

Vi T-B/32 Vi T-B/16 Vi T-B/16-768 Vi T-L/16 Vi T-L/14

# img tokens 49 196 196 196 256 Base param 151.7M 150.0M 197.3M 428.5M 428.4M

Multi-head 75.75s (+20.7M) 78.18s (+20.7M) 80.08s (+28.9M) 146.12s (+40.0M) 191.68s (+40.0M) [UNC] (proposed) 75.24s (+0.3M) 76.84s (+0.3M) 78.61s (+0.6M) 137.99s (+0.6M) 177.82s (+0.6M) Deterministic 75.18s 76.68s 78.53s 137.77s 176.95s

found the level 0 and level 1 captions have relatively similar uncertainty values (but still 78.2% level 0 captions have larger uncertainty than their level 1 caption, but as we observed in Figure 9 and B.1, the difference between level 0 and 1 captions are often vague (e.g., water sports vs. kite surfing). Overall, we argue that Pro LIP captures human uncertainty preference well as supported by Figure C.4.

C.7 MORE DISCUSSIONS FOR BAYESIAN PROMPT RE-WEIGHTING (BPRW)

Hyperparameters. When ground-truth labels are not accessible (i.e., K = 0 in Table 3), we set the α to 5. Then, we select 5 nearest images from each class embedding made by 80 prompts. We then sample 10 samples from the selected image embeddings and get 100 point image embeddings. Now, the sampled embeddings are used as observation of the algorithm (See Appendix A.8). We use ε =0.02 for a stable convergence. For few-shot settings with K > 0, we set α to 2 and select 100

K samples (e.g., if K = 9, then we sample 11 point vectors from the image embeddings).

Visualization of the learned weight. We show examples of π and its corresponding images. Figure C.5 shows the examples. We select three classes as examples and show their images and the learned π. Interestingly, for black-footed ferret images, we found that the context my has more than 0.5 weight. The actual images of black-footed ferrets are mostly composed of pet images, which makes sense that my prompt matches the images the most. Similarly, we observe that front curtain images are mostly low resolution due to the insufficient light in theaters, resulting in a low resolution or a dark photo becoming the most important prompts. Lastly, we see the missile images are mostly in the museums, resulting in a close-up prompt becoming the most important, but not significantly (0.1162) as much as the most contributing prompt of black-footed ferret (0.5130) and curtain (0.3921) images.

Using the π with a few-shot setting (K = 5), we visualize the learned prompt weight. Figure C.6a shows the histogram of the maximum πc for each c; an uniform distribution will have 0.01, while larger max πc denotes that specific prompts specifically selected for the class. In the figure, we can observe that the π is generally larger than uniform. In addition, we plot the entropy of π in Figure C.6b, where it shows a similar result.

C.8 MORE APPLICATIONS OF THE LEARNED UNCERTAINTY

Dataset filtering. Below, we show the Data Comp CLIP filtering small track (filtering 12.8B noisy web-crawled image-text pairs) by using our method and baselines provided by Data Comp:

Pro LIP uncertainty-aware features help better filtering compared to the other baselines. However, we note that Pro LIP is not specifically designed for dataset filtering; proposing a new dataset filtering method using Pro LIP will be an interesting future work, but not the scope of the current paper.

Understanding image dataset. Pro LIP s image uncertainty is not the same as the image uncertainty of classification tasks. In classification tasks, an image has a high uncertainty if it can be matched to multiple classes , while Pro LIP assigns a high uncertainty if an image can be described in multiple different and various captions (Appendix A.1). For example, assume an image with a white background and a clear overall object shape. For classification, it has low uncertainty because there is no confounder to the classification. However, Pro LIP will assign a high uncertainty for this.

Published as a conference paper at ICLR 2025

From this, we can think two different scenarios: (1) When all images have a homogeneous background and only the quality of the image determines the classification performance (e.g., MNIST), (2) When images are natural images and task is inherently multi-object classification, but the labels are single-labeled (e.g., Image Net as discussed by (Beyer et al., 2020; Yun et al., 2021))

Class name: Black-footed ferret

Class name: Front curtain

Class name: Missile

[0.5130] a photo of my black-footed ferret. [0.0250] itap of my black-footed ferret. ... [0.0056] a origami black-footed ferret [0.0056] a sculpture of a black-footed ferret

[0.3921] a low resolution photo of the front curtain. [0.1102] a dark photo of a front curtain. ... [0.0056] a drawing of a front curtain [0.0056] a cartoon front curtain

[0.1162] a close-up photo of a missile. [0.0842] a black and white photo of the missile. ... [0.0056] the plushie missile [0.0056] a doodle of the missile

Figure C.5: Visualization of learned π by BPRW for each class.

Published as a conference paper at ICLR 2025

0.1 0.2 0.3 0.4 0.5 max c

(a) Histogram of max πc. The uniform distribution will have 0.001 max π.

2.8 3.0 3.2 3.4 3.6 3.8 4.0 Entropy of c

(b) Histogram of entropy of πc. The uniform distribution will have 6.9078 entropy value.

Figure C.6: Statistics of the learned π by BPRW. We use π obtained from the few-shot setting with K = 5.

Table C.7: Dataset filtering. Results on Data Comp small track (Gadre et al., 2024).

Size Image Net IN dist. VTAB Retrieval Average

No filtering 12.8M 0.025 0.033 0.145 0.114 0.132 Random subset (25%) 3.2M 0.022 0.032 0.130 0.099 0.126 LAION-2B filtering 1.3M 0.031 0.040 0.136 0.092 0.133 English (fasttext), cap length, and img size 3M 0.038 0.043 0.150 0.118 0.142 Image-based & CLIP score (L/14 30%) 1.4M 0.039 0.045 0.162 0.094 0.144 CLIP L14 (20%) 2.6M 0.042 0.051 0.165 0.100 0.151 Pro LIP distance (20%) 2.3M 0.042 0.047 0.167 0.117 0.154

As the first example, we choose MNIST. We observe that the learned image uncertainty and the MNIST accuracy show a strong negative correlation, (-0.98), namely, if an image is more uncertain then Pro LIP tends to estimate a wrong label. As the second example, we choose Image Net-1k, which shows a strong positive correlation (+0.98), i.e., a certain image tends to be wrongly classified by Pro LIP. This could be counterintuitive in classification , but it is a correctly estimated value. For example, Image Net contains various image distributions. Some images are thumbnail images with a white background (high uncertainty) and some images are in-the-wild images with complex background and objects (low uncertainty). In this case, a certain image (more complex images) will be more difficult images to be classified, which supports the positive correlation.

Overall, Pro LIP s image uncertainty tendency can be used to understand an image dataset. Converting Pro LIP s image uncertainty to image classification uncertainty would be an interesting topic.

Uncertainty by image manipulation. We additionally show the relationship between image manipulation and uncertainty. We evaluate Image Net 1k zero-shot accuracy by applying a center occlusion. We applied 0% to 10% occlusion ratio (Table C.8). We also tested optimized noise by the PGD attack (Madry et al., 2018) with sampled Image Net (Table C.9):

Table C.8: Occlusion vs. image uncertainty. Numbers are measured in the Image Net validation set.

Occlusion ratio 0% 2.5% 5% 7.5% 10%

Image Net-1k zero-shot 74.6 74.1 73.8 73.5 73.2 avg(σv) 0.0148 0.0149 0.0152 0.0153 0.0153

Here, we observe that the image uncertainty is increased by more severe manipulation. Note that as we discussed in Understanding image dataset , converting Pro LIP s image uncertainty to image qualification would be an interesting topic, but we remain this for future work.

Published as a conference paper at ICLR 2025

Table C.9: PGD vs. image uncertainty. PGDk denotes the PGD attack with k iterations.

Clean PGD1 PGD5 PGD10 PGD40

Image Net-1k zero-shot (1000 images) 72.9 20.0 3.8 2.6 2.5 avg(σv) 0.0147 0.0149 0.0167 0.0175 0.0190

Pro LIP with long context text. Although Pro LIP can capture inherent ambiguity in visionlanguage tasks, Pro LIP is limited to capturing long context text exceeding 64 tokens. To tackle the issue, we fine-tune the pre-trained Pro LIP with long context texts, following Long CLIP (Zhang et al., 2024). In our primitive study, we observe that fine-tuning solely with the long text dataset can lead to a significant performance drop for general zero-shot tasks. More detail of our extension, Long Pro LIP (expanding the text context length from 64 to 256), is out of the scope of this paper; we refer to a technical report for Long Pro LIP for more interested readers (Chun & Yun, 2025).

D DISCUSSION AND LIMITATION

As Pro LIP is based on a normal distribution with diagonal covariance, Pro LIP also shares two concerns discussed by Chun (2024): (1) using diagonal normal distribution would be insufficient to represent compared to the full covariance, (2) if we use different probability distributions (e.g., von Mises Fisher distribution or Laplacian distribution), the closed-form for PPCL and inclusion loss will not work anymore.

For the first concern, as already discussed by Chun (2024), the diagonal covariance would be insufficient if the dimensionality is too small (e.g., less than ten). In this case, using the full covariance or mixture of Gaussian (Mo G) will improve the representation power of the uncertainty. However, in practice, we use a very high dimensionality, e.g., 768 for Vi T-B/16, that can sufficiently capture complex semantics. One also can argue that using Mo G is more sensible to capture many-to-many correspondences. However, if we have sufficiently large dimensionality, Mo G will not be effective. Consider a probabilistic embedding with 2-Mo G, namely Z 1

2N(µ1, Σ1) with probability 0.5 and Z 1

2N(µ2, Σ2) with probability 0.5. We can compute the expected CSD between two Zs (where Z1 is parameterized by µ1 1, µ1 2, Σ1 1, Σ1 2 and Z2 is parameterized by µ2 1, µ2 2, Σ2 1, Σ2 2) by computing

1 4 d(µ1 1, µ2 1, Σ1 1, Σ2 1) + d(µ1 1, µ2 2, Σ1 1, Σ2 2) + d(µ1 2, µ2 1, Σ1 2, Σ2 1) + d(µ1 2, µ2 2, Σ1 2, Σ2 2) , (D.1)

where d( ) is CSD, i.e., d(µ1, µ2, Σ1, Σ2) = µ1 µ2 2 2 + tr(Σ1 + Σ2). To simplicity, we omit 1 4 for the remaining derivation. Now, consider two virtual unimodal Gaussian embeddings W1 N(µ1 1 µ1 1 µ1 2 µ1 2 , Σ1 1 Σ1 1 Σ1 2 Σ1 2) and W2 N(µ2 1 µ2 2 µ2 1 µ2 2 , Σ2 1 Σ2 2 Σ2 1 Σ2 2), where denotes the concatenation operation, i.e., W1 and W2 have four times larger dimensionality than Z1 and Z2. Interestingly, we can easily show that Equation (D.1) equals CSD between W1 and W2. Note that this derivation is invariant to the diagonal covariance, but also holds for the full covariance. In other words, using Mo G is mathematically equivalent to using a larger dimensionality (as much as the square of the number of modes); therefore, if we have a sufficiently large dimensionality that can capture the ambiguity of the dataset, Mo G is not a mandatory option.

The second concern can be raised when we use a different probability distribution. As discussed by Chun (2024), all objective functions and computations are distribution-free, but the derived closedform solutions (e.g., CSD, inclusion loss) will not work anymore if we use different distributions. One exception is Mo G with equal mixing coefficients, but it equals simply using larger dimensionality. In practice, if we really need different distributions, we can use a Monte-Carlo approximation as Chun et al. (2021), which is known to be inefficient and inaccurate (Chun, 2024).