# what_makes_imagenet_look_unlike_laion__09b00f32.pdf

Published in Transactions on Machine Learning Research (02/2025)

What Makes Image Net Look Unlike LAION

Ali Shirali shirali_ali@berkeley.edu University of California, Berkeley

Moritz Hardt Max Planck Institute for Intelligent Systems, Tübingen and Tübingen AI Center

Reviewed on Open Review: https: // openreview. net/ forum? id= Ir BYuh9W3T

Image Net was famously created by querying several image search engines such as Flickr. What if we recreated Image Net instead by searching the massive LAION dataset based on image captions alone? In this work, we carry out this counterfactual investigation. We find that the resulting Image Net recreation, which we call LAIONet, looks distinctly unlike the original. Specifically, the intra-class similarity of images in the original Image Net is dramatically higher than it is for LAIONet. Consequently, models trained on Image Net perform significantly worse on LAIONet. We propose a rigorous explanation for the discrepancy in terms of a subtle, yet important, difference in two plausible causal data-generating processes for the respective datasets, that we support with systematic experimentation. In a nutshell, searching based on an image caption alone creates an information bottleneck that mitigates the selection bias otherwise present in image-based filtering. Our explanation formalizes a long-held intuition in the community that Image Net images are stereotypical, unnatural, and overly simple representations of the class category. At the same time, it provides a simple and actionable takeaway for future dataset creation efforts.

1 Introduction

For nearly a decade, Image Net (Deng et al., 2009) was the focal benchmark for much of computer vision and deep learning. Created from image web search results and human filtering, Image Net contributed curated images suitable for supervised learning at the time. In recent years, however, the community has seen a new generation of models trained on massive amounts of noisy image-text data gathered from the web with minimal curation. Available to the academic public is the massive scale LAION dataset, in two versions, featuring 400 million (Schuhmann et al., 2021) and 5 billion (Schuhmann et al., 2022) crawled image-text pairs, filtered by the Open AI CLIP model (Radford et al., 2021) for sufficient image-text relevance rather than human annotators.

At the outset, LAION works much like text-based web image search. We can specify a query and retrieve images with high similarity between the query and the text surrounding the image on the website from which it was crawled. We can therefore search LAION for each of the 1000 categories in the Image Net ILSVRC-2012 dataset1 and retrieve images corresponding to each of the classes. This process is much like the first step of creating Image Net from Flickr search results, except that LAION replaces Flickr, but either way, both are based on web crawls. Where the creators of Image Net hired human annotators to filter images, we analyze image captions to ensure that the resulting images have high fidelity to the class category.

We might expect that for a suitably chosen textual similarity threshold, the resulting dataset would bear resemblance to the original Image Net. However, we demonstrate that this is anything but the case. The dataset, so created from LAION, very much looks unlike Image Net. And we explain why, supported by

1Unless otherwise stated, by Image Net we mean the Image Net ILSVRC-2012 dataset.

Published in Transactions on Machine Learning Research (02/2025)

independent evidence from other well-curated datasets. This explanation, although subtle, reveals a fundamental fact about the difference between Image Net and LAION that has consequences for understanding dataset creation at large.

1.1 Our Contributions

We introduce a new research artifact, called LAIONet, that aims at a recreation of Image Net on the basis of LAION. We start from LAION-400M, a collection of 400M image-text pairs extracted from web pages in Common Crawl (commoncrawl.org) between 2014 and 2021. The relevance of images and their corresponding texts was quality-controlled with Open AI CLIP model, excluding instances with a cosine similarity of image and text embeddings less than 0.3.

Creation of LAIONet. We create LAIONet solely on the basis of text-based selection. We require the exact lemmas (terms) in a so-called synset of an Image Net category to appear in the text corresponding to an image. Moreover, we require a high similarity between the text and the synset name and definition. We use the cosine similarity of CLIP text embeddings to calculate this similarity, however, we make consistent observations using MPNet (Song et al., 2020) as the text encoder. LAIONet selection criteria are conservative in that they tend toward images that are easy to classify; at least from the CLIP point of view, there is no evidence that LAIONet images are harder to classify than Image Net.

Contrasting LAIONet and Image Net. To begin to understand the differences between LAIONet and Image Net, we evaluate a slew of Imagenet models on LAIONet. As we show, the accuracy of models trained on Image Net drops by 5 to 12 percentage points when evaluated on LAIONet (Figure 1). In calculating accuracy, we weight classes uniformly as is done in Image Net. When classes are weighted based on the frequency of each class in LAIONet, accuracy drops by another 5 to 10 percentage points.

0.6 0.7 0.8 LAIONet (equi-weighted classes)

(validation set)

Top 1 accuracy

1k models pt22k-ft1k models pt22k models y=x

0.7 0.8 0.9 LAIONet (equi-weighted classes)

(validation set)

Top 5 accuracy

1k models pt22k-ft1k models pt22k models y=x

Figure 1: Accuracy of Image Net-trained models when evaluated on Image Net validation set versus LAIONet. Three types of models are distinguished based on whether they are pre-trained on Image Net-22k and whether they are fine-tuned on Image Net-1k. Accuracy is defined as the average of the recalls calculated for each class that is present in LAIONet.

Drops in accuracy, such as these, are a well-documented phenomenon in machine learning at this point. In this work, we go a step further by providing a substantive explanation for the difference between LAIONet and Image Net.

Diagnosing the Difference. In the first step, we observe that the intra-class similarity, measured as the pairwise similarity of the images within a class, is lower for LAIONet than for Image Net. In other words, LAIONet images are more diverse in each class. The recall of the models is also lower in the classes with lower intra-class similarity. Hence, lower intra-class similarity gives a concrete reason for why the accuracy of Image Net models drops on LAIONet. But why does LAIONet have lower intra-class similarity in the first place?

Published in Transactions on Machine Learning Research (02/2025)

We answer this question in terms of two plausible causal graphs for the respective data-generating processes (Figure 2). Both graphs are based on the standard anti-causal representation of classification problems (Schölkopf et al., 2012), whereby for each category Y there is a mechanism to generate data (here, image X and text T) given Y . But, the graphs differ in one important aspect.

(a) LAIONet data collection

(b) Image Net data collection

Figure 2: The suggested underlying mechanism of data generation and selection in LAIONet and Image Net. Class Y , text description T, image X, selection S or S . The dashed line between X and T means that such links can exist in the graph.

In the case of LAIONet (Figure 2a), selection is based on text alone. The causal graph has the important property that the distribution of the image is independent of the selection decision conditional on the text. In other words the text serves as an information bottleneck between the selection mechanism and the image. Choosing an image reveals nothing more about the image than what can be learned from its textual representation. This powerful conditional independence property limits how much selection can bias the distribution of the image. In contrast, in the case of Image Net (Figure 2b), there is a link from the image to the selection decision. For example, this link exists when human annotators see the full image and decide to select or discard an image. The existence of this link is what can strongly bias the distribution of the image conditional on selection. It is this selection bias that is visible in the higher intra-class similarity.

Our case hinges on the existence and strength of the image-to-selection link in the causal graph for Image Net. We then go beyond LAIONet and provide three complementary arguments as evidence:

We can weaken the image-to-selection link by considering Image Net instances of different selection frequencies. The selection frequency describes the rate at which Amazon MTurk workers selected a candidate image into the dataset within a target class. This allows us to modulate the strength of the image-to-selection link. Looking at three versions of Image Net V2 (Recht et al., 2019), we find that for a lower selection frequency, the resulting images come closer to LAIONet.

We show that text alone cannot explain why an image was selected into Image Net. The Image Net Captions dataset (Fang et al., 2022) has restored the captions for one-third of the original Image Net images. If the text was the only factor in determining the relevance to a synset, it should explain why the images in Image Net-Captions are selected. Looking at the similarity between texts and their synsets, a majority of text-synset pairs exhibit high similarity, but the distribution has a heavy tail and there are instances with low similarity. For pairs with low similarity, there are often many other synsets more similar to the text. This makes these instances unlikely to have been selected solely based on their text.

We search LAION for the texts most similar to the texts from the Image Net-Captions dataset. The resulting images show significantly higher variability (in other words, lower intra-class similarity) than Image Net. This suggests that another mechanism must have been at play.

In conclusion, we argue that the image-to-selection mechanism was significantly at play in the creation of Image Net. It is this mechanism that makes Image Net look unlike LAION. This insight has direct prescriptive

Published in Transactions on Machine Learning Research (02/2025)

value for dataset creation efforts in general. When creating a dataset and diversity is desired, we should select candidates on the basis of an information bottleneck. A succinct text caption, for example, generally carries much less information than the entire image. Selecting on the basis of the text caption, therefore, retains much of the entropy present in the image distribution.

All code is available at: https://github.com/alishirali Git/eval-on-laion

1.2 Related Work

Torralba & Efros (2011) introduced cross-dataset evaluation in computer vision and specifically illustrated the uniqueness that each common dataset at the time, including Image Net, possesses. Recreating an Image Net test set, called Image Net V2, although with a different motivation, was the subject of the seminal paper by Recht, Roelofs, Schmidt, and Shankar (2019). Engstrom et al. (2020) argue that there is a subtlety in thresholding empirical estimates of the true underlying selection frequency of an image in Image Net V2. Our argument, however, does not rely on any specific threshold of the selection frequency. We only need to observe what happens as we vary it from small to large. In contrast to Image Net V2, our goal is not to recreate Image Net as closely as possible. Rather it is the differences between Image Net and LAION that are the focus of our investigation.

Many other works have modified Image Net for a variety of reasons. Geirhos et al. (2019) created a stylized version of Image Net to reduce the reliance of the trained model on texture. Xiao et al. (2021) disentangled the foreground and background of Image Net images to show the tendency of the models to rely on the background. Li et al. (2023b) proposed Image Net-W test set by inserting a transparent watermark into the images of Image Net validation set, revealing the reliance of the models on watermarks. Image Net undergoes ongoing augmentation over time. For example, the Image Net-Captions (Fang et al., 2022) project has restored the captions of about one-third of original Image Net images from Flickr. Image Net-X (Idrissi et al., 2023) provides a set of human annotations pinpointing 16 failure types for Image Net such as pose, background, or lighting. The peculiarities of Image Net have been the subject of multiple studies. For example, Huh et al. (2016) found the large size and many classes, including very similar classes, do not affect the successful transfer performance of Image Net-trained features.

On the side of LAION, researchers are keenly interested in understanding the strong zero-shot accuracy of contrastive language image models using LAION (Vogel et al., 2022). Fang et al. (2022) found none of the large training set size, language supervision, and contrastive loss function determines this robustness and a more diverse training distribution should be the main cause. Our work demystifies this distributional advantage by contrasting Image Net and LAION. Nguyen et al. (2022) compared various large image-text datasets differing in the creation process and found the robustness induced by each varies widely in different aspects, suggesting further studies of the role of dataset design. Our work highlights an important mechanism at play in dataset design that can move the dataset further away from a natural distribution.

2 LAIONet: An Image Net Out of LAION

Our starting point is to create an Image Net-like dataset from LAION. This dataset is a research artifact intended to highlight the differences between LAION and Image Net. Our goal is not to provide a new benchmark or a new training set. However, LAIONet might be of interest to obtain diverse samples, or variants of LAIONet may be created to improve our understanding of benchmarks.

To start, recall that every Image Net class corresponds to a Word Net (Miller, 1998) synset which consists of so-called lemmas. Synsets also come with a short definition known as gloss. We label a LAION instance with a Word Net synset if 1) at least one lemma from the synset exists in the text of the instance, and 2) this text is sufficiently similar to the name and definition of the synset. Out of LAION 400M samples, 21M of them passed the first condition. The second condition ensures the lemma as found in the LAION sample has the intended meaning. To quantify the similarity of the LAION text and a synset, we first create a textual representation for the synset by concatenating its name and definition (to be called the synset text). We then calculate the embedding vectors for both the synset text and LAION text using CLIP and compute

Published in Transactions on Machine Learning Research (02/2025)

their cosine similarity. Alternatively, one may use any sufficiently powerful text encoder for this purpose. For instance, we repeat this process using MPNet (Song et al., 2020) in Appendix A.

Figure 3a illustrates the distribution of LAION text to synset text similarities. In general, a high value for textual similarity ensures the LAION text is describing the same object as the synset. But as Figure 3b shows, we cannot set a very high similarity threshold since the extracted dataset will lose its coverage over the Image Net s 1k classes. We found the threshold of 0.82 the highest reasonable choice as it allows for covering most classes while going beyond it sharply reduces the number of covered classes (Figure 3b) with no significant reduction in the dataset size (Figure 3c). To further support this choice, in Section 4 (Figure 11b), we demonstrate that using the restored captions of Image Net, a textual similarity of above 0.7 is sufficient to ensure that a sample belongs uniquely to the synset. Also refer to Appendix C for an example of when the second step of filtering is necessary and why the chosen threshold is conservative.

0.5 0.6 0.7 0.8 0.9 text-to-[name: def] similarity

0.5 0.6 0.7 0.8 0.9 text-to-[name: def] similarity

num. of classes included

0.5 0.6 0.7 0.8 0.9 text-to-[name: def] CLIP similarity

dataset size

Figure 3: Filtering LAION samples based on their textual similarity to the candidate synsets. The dashed line shows the chosen threshold. (a) The overall probability density function (pdf) of the similarities prior to the second step of filtering. (b and c) The number of Image Net classes covered by at least one example in the dataset and the size of the dataset for different levels of similarity threshold.

giant_panda

digital_watch

little_blue_heron

hermit_crab

Figure 4: Relative frequencies of different classes in LAIONet sorted in descending order for the 500 most frequent classes. Some class names are shown. The red line shows uniform weight.

We take a few additional measures to guarantee the safety and quality of the chosen instances. First, we drop samples with more than one label to simplify the evaluation on the dataset. Second, we drop images tagged as not-safe-for-work in LAION. Finally, we exclude images that contain text matching the name of their synset. This will ensure the captions are describing an object in the image and not just reflecting on another text. To achieve this, we employ EAST for text detection (Zhou et al., 2017) and Tr OCR for text recognition (Li et al., 2023a). This step eliminates 1.1% of the samples.

The final dataset, which we call LAIONet, consists of 822k samples from 915 Image Net classes, sufficiently large for fine-grained evaluation purposes at statistical significance. Unlike Image Net which provides about the same number of images per class, the large variation in the relative frequency of the classes in LAIONet reflects the natural distribution of each class (Figure 4). In particular, 95% of LAIONet classes have at least 6 samples, 90% have at least 13 samples, 80% have at least 36 samples, and 70% have at least 75 samples. We will later use the relative frequency of classes to compare the performance of models in frequent and infrequent classes. Note that we can also create a more conservative version of LAIONet mimicking the Image Net validation set by retaining only the top 50 most similar instances for each class. This version

Published in Transactions on Machine Learning Research (02/2025)

of LAIONet yields consistent observations in general (Appendix B). Find sample images of LAIONet in Appendix I.

Are LAIONet images harder to classify? To find out, we compare CLIP zero-shot accuracy on LAIONet and Image Net. For every image, we predict the label of the image based on what synset has the highest cosine similarity between the image embedding and the synset text embedding. To make accuracy estimates on LAIONet comparable with Image Net, we calculate accuracy as the average recall across the classes present in LAIONet. This uniform weighting is consistent with the setup of Image Net validation with 50 images per class. We found CLIP zero-shot top 1 accuracy to only differ by 2% across datasets. Hence, at least from the CLIP view, LAIONet images are not harder to classify in terms of average accuracy. We acknowledge a limitation in that the CLIP text embeddings used for LAIONet creation are jointly trained with the image embeddings used for zero-shot accuracy calculations. This may give CLIP an advantage on LAIONet, and CLIP zero-shot accuracy should be interpreted with caution. Appendix D offers a more direct assessment of the level of difficulty involved in identifying the intended object in LAIONet. This is achieved by directly computing the cross-modality similarity between an image and its associated synset. Overall, LAIONet images do not exhibit significant difficulty compared to Image Net.

3 LAIONet Versus Image Net

We begin to understand the differences between the two datasets by looking at the accuracy of various Image Net classifiers on LAIONet. After observing a significant accuracy drop, we consider the disparity in intra-class similarity as a possible explanation.

3.1 Comparing Accuracy

We consider four model families: Res Net (He et al., 2016), Vision Transformers (Vi T) (Dosovitskiy et al., 2021), modernized Conv Net (Conv Ne Xt) (Liu et al., 2022), and Bidirectional Encoder representation from Image Transformers (BEi T) (Bao et al., 2022). All models are trained on Image Net without extra training data. We use various versions of each model in terms of the size (small, base, large, etc.), image resolution (224x224 or 384x384), patch resolution (16x16 or 32x32), and whether models are pre-trained on the complete Image Net with 22k classes or not. All models come from Hugging Face (huggingface.co) checkpoints.

We first compare the (equally weighted) accuracy defined by the average of recalls across the classes covered by LAIONet. Figure 1 compares the top 1 and top 5 accuracy on Image Net and LAIONet. In most of the highly accurate models, accuracy drops by at least 10 percentage points when estimated on LAIONet with models pre-trained on Image Net-22k showing slightly more robustness.

Next, we use the relative frequency of each class in LAIONet to weight its recall and obtain a LAIONweighted accuracy. Figure 5 compares LAION-weighted and equally-weighted accuracy on LAIONet. The LAION-weighted accuracy is consistently lower by 5 to 10 percentage points. This can partially be explained by the observation that Image Net-trained models are performing worse when the class is describing a more common object (Appendix G.1). We also compared LAION-weighted and equally-weighted accuracy on Image Net and found consistent results in Appendix F.

3.2 Comparing Intra-Class Similarity

While LAIONet images are in a precise sense not more difficult than Image Net, there is another factor that can explain the accuracy drop: the intra-class similarity of images. We define this similarity as the pairwise similarity of the images from the same class, measured by the cosine similarity of their CLIP image embeddings. The lower these similarity values, the more diverse the images from that class.

Figure 6a shows the distribution of intra-class similarities aggregated over all the classes that have at least 7 samples in LAIONet. To make the distributions comparable, we sampled (with replacement) the similarities from LAIONet to match Image Net. The left tail of the LAIONet intra-class similarity distribution makes it clear that LAIONet overall provides a more diverse set of images. To observe the effect in greater detail, for each class, Figure 6b shows the average intra-class similarity of LAIONet images subtracted by the average

Published in Transactions on Machine Learning Research (02/2025)

0.5 0.6 0.7 0.8 LAIONet (equi-weighted classes)

(LAION-weighted classes)

Top 1 accuracy

1k models pt22k-ft1k models pt22k models y=x

0.7 0.8 0.9 LAIONet (equi-weighted classes)

(LAION-weighted classes)

Top 5 accuracy

1k models pt22k-ft1k models pt22k models y=x

Figure 5: A LAION-weighted accuracy is calculated according to the relative frequency of the classes in LAIONet and compared to the accuracy with equally weighted classes.

intra-class similarity of Image Net images from the same class. In almost two-thirds of the classes, LAIONet has significantly lower intra-class similarity. This provides further evidence that LAIONet images exhibit greater variability within each class.

0.0 0.2 0.4 0.6 0.8 1.0 intra-class image-image CLIP similarity

LAIONet Image Net

(a) Dist. of aggregated similarities

0 200 400 600 800

intra-class image-image

CLIP similarity

LAIONet - Image Net UCB LCB

(b) Comparison across classes

Figure 6: Comparing the intra-class similarity of LAIONet and Image Net. (a) In each class, pairwise similarities of LAIONet images are sampled to match Image Net in number. All the classes combined, the distribution of intra-class similarity is depicted. (b) For each class, the average intra-class similarity of Image Net images is subtracted from the same value in LAIONet. The blue and red curves show upper and lower 95% confidence intervals. All values are sorted ascendingly.

In Appendix G.2, we show that models struggle more with classes where LAIONet and Image Net have significantly different intra-class similarity. This, combined with our observation of LAIONet having lower intra-class similarity, supports our argument that intra-class similarity plays a crucial role in reducing accuracy.

4 Diagnosing Image Net

As is standard modeling practice, we think of a data-generating process that for a given class Y generates a pair of image X and text T. Ideally, when we search for images of a particular class y, we would like to draw samples from distribution p(X|Y = y). Unless we have access to the generative process or we have a completely random set of images all correctly labeled, drawing samples directly from p(X|Y = y) will not be possible. In particular, none of these options are available when researchers collect a new dataset. Instead, researchers have to define a selection mechanism S for choosing images. What we observe is the conditional distribution of X given S.

Published in Transactions on Machine Learning Research (02/2025)

In creating LAIONet, we relied on texts to select the samples (Figure 2a). LAIONet images follow p(X|S = 1), where S = 1 if T is sufficiently similar to Y . With our conservative selection criteria, we can assume every T passed our similarity threshold is generated from the intended Y = y. Therefore, p(X|S = 1) = p(X|S = 1, Y = y). Generally, an image carries much more information than the text. So, for the images of a certain class, conditioning on the text alone should not alter the distribution significantly. Intuitively speaking, p(X|Y = y, T = t) p(X|Y = y). In our setting, a weaker independence is sufficient to show LAIONet images follow the desired distribution. Even if information from X beyond Y is present in T, since we deliberately refrained from searching for visual descriptions in the text, we expect S to be independent from X for a given Y = y. Hence, we have reason to hope p(X|S = 1) p(X|S = 1, Y = y) p(X|Y = y).

In general, a selection S can rely on both text and image directly (Figure 2b). In this case, the distribution of observed images p(X|S = 1) can be far from the desired distribution p(X|Y = y). We believe this has happened in the collection of Image Net, primarily through human annotators examining and acting on images. Incorporation of visual features at the side of the search engine provider is another plausible mechanism. While we may not be able to pinpoint the exact mechanism at play, we will now move beyond LAIONet and demonstrate, through three independent experiments, a strong link between the image X and the selection criterion S in the creation of Image Net.

4.1 A Weaker Image-To-Selection Link Makes Image Net More Like LAIONet

Image annotation is one clear mechanism by which the image X influences selection S . Changing the strictness of annotation allows us to modulate the strength of this mechanism and measure its effect. This experiment is possible due to the availability of Image Net V2 (Recht et al., 2019) that comes with three different versions. The three versions of Image Net V2, called a, b, and c, differ in the level of agreement among annotators. More precisely, each image comes with an MTurk selection frequency which is what fraction of MTurk workers selected the image to be from the target class. Image Net V2 versions a, b, and c have an average MTurk selection frequency of 0.85, 0.73, and 0.93, respectively. Note that version b has the lowest and version c has the highest selection frequency.

We first observe that allowing for more disagreement among annotators results in the inclusion of more diverse images. Figure 7a shows the distribution of intra-class similarity for Image Net V2 versions b and c. One can see that in version b, with the lowest average MTurk selection frequency, the intra-class similarity is shifted toward lower values.

We next further show as the average MTurk selection frequency increases, Image Net V2 becomes more similar to Image Net and less similar to LAIONet. In this regard, to compare two datasets, we count the number of classes in which the first dataset has significantly lower intra-class similarity than the second dataset, and vice versa. Figure 7b compares LAIONet and three versions of Image Net V2. As the figure suggests, LAIONet and Image Net V2 are similarly diverse when the average MTurk selection frequency is low (corresponding to Image Net V2 version b). However, as the MTurk selection frequency increases, Image Net V2 shows higher intra-class similarity than LAIONet. At the same time, Figure 7c shows Image Net V2 becomes more similar to Image Net as we increase the MTurk selection frequency.

We also contrast the accuracy of Image Net-trained models on LAIONet and Image Net V2 in Figure 8. Reaffirming our previous observation, as the MTurk selection frequency declines from version c to a to b, the accuracy on Image Net V2 and LAIONet increasingly aligns. Notably, most models experience similar accuracy drops on Image Net V2-b and LAIONet.

Together, these observations show the impact the image has on the selection, particularly during annotation, is significant and can partially explain the divergence between LAIONet and Image Net. Further, the extra intra-class diversity of LAIONet is achievable from less stringent human annotation and can explain the consistent accuracy drop on LAIONet and Image Net V2.

4.2 Introducing an Image-To-Selection Link Makes LAIONet More Like Image Net

We deliberately refused to use visual information in selecting LAIONet samples. Complementing Section 4.1, a natural question arises: What would happen if we used multimodal LAION image-to-synset text similarity,

Published in Transactions on Machine Learning Research (02/2025)

0.0 0.2 0.4 0.6 0.8 1.0 intra-class image-image CLIP similarity

Image Net V2-b Image Net V2-c

(a) Dist. of aggregated similarities

0.75 0.80 0.85 0.90 MTurk selection frequency

proportion of classes

LAIONet < Image Net V2 LAIONet > Image Net V2

(b) LAIONet vs. Image Net V2

0.75 0.80 0.85 0.90 MTurk selection frequency

proportion of classes

Image Net < Image Net V2 Image Net > Image Net V2

(c) Image Net vs. Image Net V2

Figure 7: The effect of MTurk selection frequency on intra-class similarity. (a) The distribution of intra-class similarity aggregated over all classes for Image Net V2 versions b and c. (b) LAIONet versus three versions of Image Net V2. The vertical axis shows the proportion of classes in which one dataset has significantly lower intra-class similarity than the other. To determine whether intra-class similarity is significantly lower in a class, we compute the average intra-class similarity of images within that class for both datasets and compare their difference using a 95% bootstrapped confidence interval. Blue curve: the proportion of classes where LAIONet has lower intra-class similarity than a specific version of Image Net V2. Green curve: Image Net V2 has lower intra-class similarity. (c) Image Net versus Image Net V2. Red curve: Image Net has lower intra-class similarity. Green curve: Image Net V2 has lower intra-class similarity.

0.6 0.7 0.8 LAIONet (equi-weighted classes)

Image Net-V2

Top 1 accuracy

0.7 0.8 0.9 LAIONet (equi-weighted classes)

Image Net-V2

Top 5 accuracy

Figure 8: Accuracy of Image Net-trained models evaluated on three versions of Image Net V2 versus LAIONet.

either in place of or alongside LAION text-to-synset text similarity, to select samples? Answering this question, we construct new datasets out of LAION where we require an included sample to have image-to- [name: def] similarity greater than a threshold and text-to-[name: def] similarity also greater than another threshold. To choose these two thresholds, we control the total number of selected samples to be similar to our original version of LAIONet. Using CLIP embeddings to calculate similarities, we derive a multimodal similarity threshold for each value of the textual similarity threshold as shown in Figure 9. From this curve, we select four distinct threshold values and create a dataset for each. We call these datasets A, B, C, and D as depicted on Figure 9.

For all four datasets, we calculate the intra-class similarities of all the common classes among them. We show the distribution of intra-class similarities of datasets A, B, and D in Figure 10a. As the figure suggests, dataset A, which is the most similar to LAIONet, overall has the lowest intra-class similarities. As we strengthen the image-to-selection link by using a lower textual similarity and a higher multimodal similarity, datasets show higher intra-class similarity. This is also evident in Figure 10b where we plot the average of intra-class similarities across classes. In Figure 10c, we also look into the number of classes where a dataset has lower intra-class similarity compared to Image Net. Figures 10b and 10c together show that dataset A, the most similar to LAIONet, exhibits greater diversity than Image Net, both in terms of average intra-class similarity and the number of classes where A is more diverse. However, as we increase the multimodal

Published in Transactions on Machine Learning Research (02/2025)

0.725 0.750 0.775 0.800

text-to-[name: def] similarity

image-to-[name: def] similarity

Figure 9: For each textual similarity threshold, we find the corresponding multimodal similarity threshold such that the resulting dataset, satisfying both similarity criteria, contains the same number of samples as LAIONet. We then choose four points on this curve, A, B, C, and D, and create four such datasets.

threshold, the datasets consistently become less diverse, eventually even less diverse than Image Net. To make sure our measurement of intra-class similarities is not biased by the CLIP visual encoder, we have repeated this experiment using a variety of encoders in Appendix H and found even stronger results.

In summary, in this experiment, we used CLIP multimodal similarity to introduce an image-to-selection link into our dataset creation process. Our observations strongly suggest that this link can reduce the diversity of the included images when we control the total number of selected examples. While CLIP multimodal similarity, which is known to suffer from specific biases (Tong et al., 2024), may not accurately reflect how the original image-to-selection link worked in Image Net, it demonstrates how relying on a richer modality can make the selection process more prone to bias.

0.0 0.2 0.4 0.6 0.8 intra-class image-image CLIP similarity

(a) Dist. of aggregated similarities

avg. intra-class similarity

Image Net avg. sim.

(b) Avg. intra-class sim. in A-D

proportion of classes

w/ lower intra-class sim.

(c) A-D vs. Image Net

Figure 10: (a) The distribution of intra-class similarity aggregated over all classes for datasets A, B, and D. (b) The average of intra-class similarities compared across datasets A, B, C, and D. To calculate this average, we first find the average of intra-class similarity in each class and then take their average across all common classes of A, B, C, D. (c) The proportion of classes where the new datasets have lower-intra class similarities than Image Net.

4.3 Text Alone Cannot Explain Why an Image Is Selected Into Image Net

Image Net-Captions (Fang et al., 2022) is a subset of Image Net-1k training data with restored title, description, and tags from Flickr. We assume the samples in Image Net-Captions are a random subset of the original Image Net and the captions are accurately restored. If there was no link X S , the accompanying caption of an image in Image Net-Captions should be able to explain why this image is selected.

We follow Fang et al. (2022) and define the text as the title, description, and tags concatenated. Figure 11a illustrates the similarity between the texts and their respective synsets using CLIP text embeddings. Although most of the texts have a high similarity of 0.6 or above to their synsets, the distribution has a heavy left tail. The fact that a text has low similarity to the intended synset does not necessarily mean it could

Published in Transactions on Machine Learning Research (02/2025)

not be chosen by the search engine. However, we show many of the texts that have low similarity to the intended synsets actually have high similarity to numerous other synsets, making them less likely to have appeared for the intended meaning. For every text, we find the similarity to all synsets, i.e. the similarity to their names and definitions, and count the proportion of unintended synsets (false classes) that are more similar to the text than the intended synset. A low value for this proportion shows the text well represents its intended synset whereas a significant non-zero value indicates that there are considerable other synsets that are more strongly present in the text. As Figure 11b demonstrates, for a text with low similarity to its synset there are on average 5% (equivalently, 200) or more other synsets more similar to the text. These observations show that at least based on the restored texts in Image Net-Captions, the text alone cannot fully explain why an image is selected and another mechanism should have been at play.

0.0 0.2 0.4 0.6 0.8 1.0 text-[name: def] similarity

LAIONet Image Net-Captions

0.0 0.2 0.4 0.6 0.8 1.0 text-[name: def] similarity

proportion of false classes

more similar to the text

Figure 11: (a) The distribution of the text-to-synset similarity. (b) For every bin of text-to-synset similarity, the average proportion of unintended classes which are more similar to the text than the intended class is depicted in black.

4.4 Image Net, Had It Been Created Solely Searching Texts, Does Not Resemble Current Image Net

If the link from X to S did not exist, regardless of how the selection algorithms works, p(X|T = t) would look similar in both graphs of Figure 2. To test this hypothesis, we extract a new dataset from LAION. For every image in Image Net with corresponding text T = t in Image Net-Captions, we find the LAION sample with the most similar text to t. We only keep a LAION sample if the similarity is above 0.7. This choice ensures the two texts are sufficiently similar as we can consider them roughly the same while the dataset covers more than 95% of the Image Net classes (Appendix E).

0.0 0.2 0.4 0.6 0.8 1.0 intra-class image-image CLIP similarity

LAION | Image Net texts Image Net

(a) Dist. of aggregated similarities

0 250 500 750

intra-class image-image

CLIP similarity

LAIONet - Image Net UCB LCB

(b) Comparison across classes

Figure 12: Comparing the intra-class similarity of the new dataset and Image Net. The new dataset is obtained by selecting LAION examples with the most similar texts to the texts in Image Net-Captions. (a) Distribution of intra-class similarity aggregated across all classes. In each class, pairwise similarities of the images in the new dataset are sampled to match Image Net in number to make the distributions comparable. (b) For each class, the average of the intra-class similarity of the images in the new dataset minus the corresponding value in Image Net is plotted in black. The upper and lower 95% confidence bounds are depicted in blue and red. All values are sorted ascendingly.

Published in Transactions on Machine Learning Research (02/2025)

As Figure 12a suggests, images in the new dataset have a significantly lower intra-class similarity. Looking at each class separately, Figure 12b shows in almost 70% of the classes, the images from the new dataset are significantly more diverse (have lower intra-class similarity). These observations reject the hypothesis that the graphs of Figure 2 have the same structure and show a potential leak from the image to the selection. We note the limitation that texts in the Image Net-Captions dataset may not completely include the text available at the time of Image Net creation. Second, for many cases, we were unable to find great matches for the Image Net texts in LAION-400M and scaling our analysis to LAION-5B might help here.

5 Conclusion

In conclusion, we argue that the image-to-selection mechanism played a significant role in the creation of Image Net, distinguishing it from LAION. We demonstrated this through three experiments. First, we modulated the speculated link from image to selection, showing the significant contribution this mechanism has in reducing the diversity of the selected images. The next two experiments rejected the hypothesis that image plays no or negligible role in the selection by showing Image Net captions cannot solely explain the selection.

This insight carries valuable implications for dataset creation efforts in general. When developing a new dataset and diversity is desired, we advise selecting candidate instances based on an information bottleneck, like a succinct textual description of the instance, rather than the full instance. This will mitigate the selection bias that may otherwise distort the distribution of data conditional on selection.

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEit: BERT pre-training of image transformers. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id= p-Bh ZSz59o4.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Yicb Fd NTTy.

Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Jacob Steinhardt, and Aleksander Madry. Identifying statistical bias in dataset replication. In International Conference on Machine Learning, pp. 2922 2932. PMLR, 2020.

Alex Fang, Gabriel Ilharco, Mitchell Wortsman, Yuhao Wan, Vaishaal Shankar, Achal Dave, and Ludwig Schmidt. Data determines distributional robustness in contrastive language image pre-training (clip). In International Conference on Machine Learning, pp. 6216 6234. PMLR, 2022.

Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations, 2019. URL https://openreview. net/forum?id=Bygh9j09KX.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Minyoung Huh, Pulkit Agrawal, and Alexei A Efros. What makes imagenet good for transfer learning? ar Xiv preprint ar Xiv:1608.08614, 2016.

Published in Transactions on Machine Learning Research (02/2025)

Badr Youbi Idrissi, Diane Bouchacourt, Randall Balestriero, Ivan Evtimov, Caner Hazirbas, Nicolas Ballas, Pascal Vincent, Michal Drozdzal, David Lopez-Paz, and Mark Ibrahim. Imagenet-x: Understanding model mistakes with factor of variation annotations. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=HXz7Vcm3Vg M.

Minghao Li, Tengchao Lv, Jingye Chen, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, and Furu Wei. Trocr: Transformer-based optical character recognition with pre-trained models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 13094 13102, 2023a.

Zhiheng Li, Ivan Evtimov, Albert Gordo, Caner Hazirbas, Tal Hassner, Cristian Canton Ferrer, Chenliang Xu, and Mark Ibrahim. A whac-a-mole dilemma: Shortcuts come in multiples where mitigating one amplifies others. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20071 20082, 2023b.

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976 11986, 2022.

George A Miller. Word Net: An electronic lexical database. MIT press, 1998.

Thao Nguyen, Gabriel Ilharco, Mitchell Wortsman, Sewoong Oh, and Ludwig Schmidt. Quality not quantity: On the interaction between dataset design and robustness of CLIP. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=LTCBav FWp5C.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748 8763. PMLR, 2021.

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? In International conference on machine learning, pp. 5389 5400. PMLR, 2019.

Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. In International Coference on International Conference on Machine Learning, pp. 459 466, 2012.

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. ar Xiv preprint ar Xiv:2111.02114, 2021.

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirtysixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=M3Y74vms Mc Y.

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 33:16857 16867, 2020.

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann Le Cun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9568 9578, 2024.

Antonio Torralba and Alexei A Efros. Unbiased look at dataset bias. In CVPR 2011, pp. 1521 1528. IEEE, 2011.

Published in Transactions on Machine Learning Research (02/2025)

Felix Vogel, Nina Shvetsova, Leonid Karlinsky, and Hilde Kuehne. Vl-taboo: An analysis of attribute-based zero-shot capabilities of vision-language models. Co RR, abs/2209.06103, 2022. URL https://doi.org/ 10.48550/ar Xiv.2209.06103.

Kai Yuanqing Xiao, Logan Engstrom, Andrew Ilyas, and Aleksander Madry. Noise or signal: The role of image backgrounds in object recognition. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=gl3D-x Y7w Lq.

Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. East: an efficient and accurate scene text detector. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 5551 5560, 2017.

Published in Transactions on Machine Learning Research (02/2025)

A An MPNet-Filtered LAIONet

The creation of LAIONet relies on textual similarity of the LAION text and synset text. In Section 2 we used the cosine similarity of CLIP text embeddings to calculate this similarity, however, any sufficiently strong text encoder can be used for this purpose. In particular, we use MPNet (Song et al., 2020) fine-tuned on 1B sentence pairs with a contrastive objective by Hugging Face.2 We follow a similar procedure to Section 2 and choose the maximum similarity threshold so that the resulting dataset does not lose its coverage over classes. We select the similarity threshold of 0.58. As Figure 13 suggests, a threshold larger than 0.58 may exclude many classes without reducing the size of the resulting dataset. Refer to Appendix C for additional evidence that this threshold works.

0.2 0.4 0.6 text-to-[name: def] similarity

0.2 0.4 0.6 text-to-[name: def] similarity

num. of classes included

0.2 0.4 0.6 text-to-[name: def] similarity

dataset size

Figure 13: Filtering LAION samples based on their MPNet textual similarity to the candidate synsets. The dashed line shows the chosen threshold. (a) The overall distribution of the similarities prior to the second step of filtering. (b and c) The number of Image Net classes covered by the dataset and the size of the dataset for different levels of similarity threshold.

Proceeding with the similarity threshold of 0.58, and after dropping samples labeled as not-safe-for-work, samples with multiple labels, and images containing text of their associated synsets, this version of LAIONet will have 831k samples covering 938 classes. As Figure 14 shows, consistent with our observation from CLIPfiltered LAIONet, models trained on Image Net experience 10 to 15 percentage points of accuracy drop on MPNet-filtered LAIONet.

0.5 0.6 0.7 0.8 LAIONet (equi-weighted classes)

(validation set)

Top 1 accuracy

1k models pt22k-ft1k models pt22k models y=x

0.7 0.8 0.9 LAIONet (equi-weighted classes)

(validation set)

Top 5 accuracy

1k models pt22k-ft1k models pt22k models y=x

Figure 14: Accuracy of Image Net-trained models when evaluated on Image Net validation set versus MPNetfiltered LAIONet. Three types of models are distinguished based on whether they are pre-trained on Image Net-22k and whether they are fine-tuned on Image Net-1k. Accuracy is defined as the average of the recalls calculated for each class that is present in LAIONet.

2https://huggingface.co/sentence-transformers/all-mpnet-base-v2

Published in Transactions on Machine Learning Research (02/2025)

Last but not least, Figure 15 suggests that MPNet-filtered LAIONet also exhibits lower intra-class similarity compared to Image Net. In particular, in more than 70% of the classes, LAIONet has significantly lower intra-class similarity than Image Net.

0.0 0.2 0.4 0.6 0.8 1.0 intra-class image-image CLIP similarity

LAIONet Image Net

(a) Dist. of aggregated similarities

0 250 500 750

intra-class image-image

CLIP similarity

LAIONet - Image Net UCB LCB

(b) Comparison across classes

Figure 15: Comparing the intra-class similarity of LAIONet and Image Net. (a) In each class, pairwise similarities of LAIONet images are sampled to match Image Net in number. All the classes combined, the distribution of intra-class similarity is depicted. (b) For each class, the average intra-class similarity of Image Net images is subtracted from the same value in LAIONet. The blue and red curves show upper and lower 95% confidence intervals. All values are sorted ascendingly.

B A LAIONet From Most Similars

We created LAIONet by ensuring the presence of at least one lemma from the associated synset in the LAION text and by ensuring sufficient similarity between the synset text and LAION text. The frequency of each class in LAIONet reflects the natural distribution of that class on the web and likely worldwide. However, we can create a more conservative version of LAIONet by retaining only the top 50 most similar instances for each class. This will make LAIONet more similar to the Image Net validation set. Such a version of LAIONet will have 39k samples covering 915 classes if initially filtered by CLIP similarity threshold of 0.82, and 41k samples covering 938 classes if initially filtered by MPNet similarity of 0.58.

Figure 16 illustrates that models performing well on Image Net consistently experience a 7 to 10 reduction in accuracy on this version of LAIONet. Hence, the reduction in accuracy is consistent across all versions of LAIONets, including the most conservatively created ones. Figure 17 also confirms that this version of LAIONet still exhibits a longer tail of small intra-class similarity compared to Image Net, potentially explaining the accuracy drop.

C On the Choice of LAION Text to Synset Text Similarity Threshold

In Section 2, we described how LAIONet is generated through substring matching LAION texts with Image Net synset lemmas, followed by filtering out the cases where the LAION text is not sufficiently similar to the synset name and definition. A critical choice in the second filtering step is the choice of the minimum required textual similarity. We conservatively chose this threshold to be the largest value such that the remaining examples cover a large number of Image Net s classes. To show this filtering is necessary and our threshold of 0.82 for CLIP-based filtering and threshold of 0.58 for MPNet-based filtering is conservative, we have provided an example in Figure 18. Here the synset cougar has lemma puma . From Word Net definition, cougar is a large American feline resembling lion . But the common usage of puma on the web is about a brand. As Figure 18 shows for small similarity to the synset, data most likely will represent the brand instead of the animal. As we increase the similarity threshold, the examples become more and more likely to be from the intended meaning. Our manual inspections show similar to this example, the chosen thresholds most likely result in high-quality matching to the intended meaning of the synset even if the web is dominated by other meanings.

Published in Transactions on Machine Learning Research (02/2025)

0.6 0.7 0.8 LAIONet (equi-weighted classes)

(validation set)

Top 1 accuracy

1k models pt22k-ft1k models pt22k models y=x

(a) CLIP-filtered top 50

0.7 0.8 0.9 LAIONet (equi-weighted classes)

(validation set)

Top 5 accuracy

1k models pt22k-ft1k models pt22k models y=x

(b) CLIP-filtered top 50

0.6 0.7 0.8 LAIONet (equi-weighted classes)

(validation set)

Top 1 accuracy

1k models pt22k-ft1k models pt22k models y=x

(c) MPNet-filtered top 50

0.7 0.8 0.9 LAIONet (equi-weighted classes)

(validation set)

Top 5 accuracy

1k models pt22k-ft1k models pt22k models y=x

(d) MPNet-filtered top 50

Figure 16: Accuracy of Image Net-trained models when evaluated on Image Net validation set versus LAIONet created by retaining top 50 most similar instances for each class.

0.0 0.2 0.4 0.6 0.8 1.0 intra-class image-image CLIP similarity

LAIONet Image Net

(a) CLIP-filtered top 50

0.0 0.2 0.4 0.6 0.8 1.0 intra-class image-image CLIP similarity

LAIONet Image Net

(b) MPNet-filtered top 50

Figure 17: Comparing the intra-class similarity of LAIONet and Image Net. In each class, pairwise similarities of LAIONet images are sampled to match Image Net in number. All the classes combined, the distribution of intra-class similarity is depicted. LAIONet is created by retaining top 50 most similar instances to the synset text in each class after a textual similarity filtering with CLIP or MPNet.

D On the (Non)Difficulty of LAIONet Image Classification

To obtain a better idea of how hard is it to recognize an object in LAIONet, we calculate the cross-modal similarity of the images to the texts of their associated synsets using CLIP embeddings. A high value of image-to-synset similarity indicates CLIP is able to identify an object from the synset in the image. On the other hand, a low value could indicate that the intended object is either absent from the image or difficult

Published in Transactions on Machine Learning Research (02/2025)

0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 text-to-[name: def] CLIP similarity

(a) CLIP-based textual similarity

0.1 0.2 0.3 0.4 0.5 0.6 0.7 text-to-[name: def] CLIP similarity

(b) MPNet-based textual similarity

Figure 18: Sample images from five intervals of LAION text to synset text similarity.

to recognize. We compare the image-to-synset similarities obtained from the Image Net validation set and LAIONet.

Figure 19a illustrates the distribution of image-to-synset similarity for LAIONet and Image Net. To ensure these distributions are comparable, we sampled LAIONet with replacement to match the number of images per class in the Image Net validation set. As the figure suggests, the two datasets are not significantly different. In a more fine-grained test, we compared the image-to-synset similarity of the LAIONet and Image Net for each class. Figure 19b shows the average similarity in each class for LAIONet subtracted by the average similarity in the same class for Image Net along 95% upper and lower confidence bounds. Overall, there is no strong signal that LAIONet images are harder in particular.

0.0 0.2 0.4 0.6 0.8 1.0 image-to-[name: def] CLIP similarity

LAIONet Image Net

(a) Dist. of aggregated similarities

0 200 400 600 800

image-to-[name: def]

CLIP similarity

LAIONet - Image Net UCB LCB

(b) Comparison across classes

Figure 19: Comparing image-to-synset similarities of LAIONet and Image Net. (a) For each class, LAIONet is sampled with replacement to have the same number of images as Image Net, and all samples are aggregated to obtain the distribution. (b) For every class, the average similarity of the images to synset text is calculated for LAIONet and Image Net and the difference is plotted. The upper and lower 0.95% confidence bound for this difference is plotted in red and blue. All values are sorted ascendingly.

Published in Transactions on Machine Learning Research (02/2025)

E On the Choice of Textual Similarity Threshold in Extracting Most Similar LAION Instances to Image Net-Captions

In Section 4.4, we selected a similarity threshold of 0.7 as the minimum requirement for similarity between LAION text and Image Net text in order to include a sample from LAION. Ideally, we look for LAION examples with identical text as the Image Net but due to the limited number of samples available in LAION, this is not possible. As Figure 3b shows, increasing the similarity threshold beyond the chosen level of 0.7 significantly decreases the number of covered classes. Meanwhile, for larger thresholds, the new dataset looks more like Image Net but is still distinguishable. As Figure 20b shows, the proportion of classes with significantly lower intra-class similarity in Image Net increases as the threshold increases, while the proportion of classes with significantly lower intra-class similarity in the new dataset decreases. The gap still persists but can potentially become smaller in the region our data cannot cover. In sum, the new dataset extracted based on Image Net looks unlike Image Net but to the extent it is possible to find similar texts in LAION.

0.5 0.6 0.7 LAION text to Image Net text

CLIP similarity

num. of classes included

0.5 0.6 0.7 LAION text to Image Net text

CLIP similarity

proportion of classes

LAION < Image Net LAION > Image Net

Figure 20: The effect of similarity threshold on the dataset extracted from LAION samples with most similar texts to the Image Net texts. (a) Number of the classes covered in the new dataset versus the similarity threshold. (b) Proportion of classes with significantly lower intra-class similarity in the new dataset (blue) and proportion of classes with significantly lower intra-class similarity in Image Net (red) versus the similarity threshold.

F LAION-Weighted Versus Equally-Weighted Accuracy Evaluated on Image Net

In Section 3.1 we introduced LAION-weighted accuracy where we use the relative frequency of each class in LAIONet to weight its recall. As we presented in Figure 5, the LAION-weighted accuracy is consistently lower than the equally-weighted accuracy when models are evaluated on LAIONet. This observation is not limited to evaluation on LAIONet. In fact, Figure 21 shows when we weight the classes according to their relative frequency on LAIONet, Image Net accuracy also decreases. This can be attributed to the challenge of recognizing more frequent objects, given their potentially diverse types.

G The Relation of Recall, Relative Frequency, and Intra-Class Similarity

G.1 Recall Versus Relative Frequency

In Section 3.1 we observed accuracy drops when we weight different classes according to their frequency in LAIONet. This can be partially explained as models perform worse in more frequent classes. To directly observe this, Figure 22 shows the recall in each class versus the relative frequency of the class in LAIONet. Regardless of whether LAIONet is created by filtering based on CLIP textual similarity or MPNet similarity, there exists a weak but consistent trend that more frequent classes are more likely to be misclassified.

Published in Transactions on Machine Learning Research (02/2025)

0.6 0.7 0.8 0.9 Image Net (equi-weighted classes)

(LAION-weighted classes)

Top 1 accuracy

1k models pt22k-ft1k models pt22k models y=x

0.7 0.8 0.9 1.0 Image Net (equi-weighted classes)

(LAION-weighted classes)

Top 5 accuracy

1k models pt22k-ft1k models pt22k models y=x

Figure 21: On Image Net, a LAION-weighted accuracy is calculated according to the relative frequency of the classes in LAIONet and compared to the accuracy with equally weighted classes.

G.2 Recall Versus Intra-Class Similarity

Section 3.2 introduced the hypothesis that higher intra-class similarity may account for the lower-thanexpected performance of Image Net models on LAIONet. To observe that intra-class similarity can be responsible for accuracy drop, Figure 23 demonstrates that models struggle on classes where LAIONet is more diverse than Image Net, as shown by the recall rates plotted against the difference in average intra-class similarity. This is true regardless of what notion of accuracy and what version of LAIONet, CLIP-filtered or MPNet-filtered, is used.

Published in Transactions on Machine Learning Research (02/2025)

relative frequency

relative frequency

convnext-base-224

relative frequency

convnext-base-224-22k-1k

relative frequency

vit-base-patch16-224

(a) Recall@1 (CLIP-filtered LAIONet)

relative frequency

relative frequency

convnext-base-224

relative frequency

convnext-base-224-22k-1k

relative frequency

vit-base-patch16-224

(b) Recall@1 (MPNet-filtered LAIONet)

relative frequency

relative frequency

convnext-base-224

relative frequency

convnext-base-224-22k-1k

relative frequency

vit-base-patch16-224

(c) Recall@5 (CLIP-filtered LAIONet)

relative frequency

relative frequency

convnext-base-224

relative frequency

convnext-base-224-22k-1k

relative frequency

vit-base-patch16-224

(d) Recall@5 (MPNet-filtered LAIONet)

Figure 22: Recall per class evaluated on LAIONet versus how frequent the class is in LAIONet. Four different models are used, where two of them are pretrained on Image Net-21k and two of them are not. Two versions of LAIONet, CLIP-filtered and MPNet-filtered are included. Trends are consistent.

Published in Transactions on Machine Learning Research (02/2025)

0.25 0.00 0.25 intra-class similarity

0.25 0.00 0.25 intra-class similarity

convnext-base-224

0.25 0.00 0.25 intra-class similarity

convnext-base-224-22k-1k

0.25 0.00 0.25 intra-class similarity

vit-base-patch16-224

(a) Recall@1 (CLIP-filtered LAIONet)

0.25 0.00 intra-class similarity

0.25 0.00 intra-class similarity

convnext-base-224

0.25 0.00 intra-class similarity

convnext-base-224-22k-1k

0.25 0.00 intra-class similarity

vit-base-patch16-224

(b) Recall@1 (MPNet-filtered LAIONet)

0.25 0.00 0.25 intra-class similarity

0.25 0.00 0.25 intra-class similarity

convnext-base-224

0.25 0.00 0.25 intra-class similarity

convnext-base-224-22k-1k

0.25 0.00 0.25 intra-class similarity

vit-base-patch16-224

(c) Recall@5 (CLIP-filtered LAIONet)

0.25 0.00 intra-class similarity

0.25 0.00 intra-class similarity

convnext-base-224

0.25 0.00 intra-class similarity

convnext-base-224-22k-1k

0.25 0.00 intra-class similarity

vit-base-patch16-224

(d) Recall@5 (MPNet-filtered LAIONet)

Figure 23: Recall on LAIONet for each class versus the disparity in intra-class similarity between LAIONet and Image Net. This disparity (horizontal axis) is measured by subtracting the class-average intra-class similarity in Image Net from that in LAIONet. Four exemplary models are shown, where two of them are pretrained on Image Net-21k (yellow) and two of them are not (red). Two versions of LAIONet are considered. Trends are consistent.

Published in Transactions on Machine Learning Research (02/2025)

H Using a Variety of Image Encoders to Calculate Intra-Class Similarities

In Section 3.2, we introduced intra-class similarity and used CLIP image embeddings to calculate that. Since CLIP has well-known visual biases (Tong et al., 2024), we repeat our calculations using a variety of other encoders. This includes using the last hidden layer of the base Vi T, BEi T, and Conv Ne XT models, all pre-trained on the larger Image Net-22k datasets.

Figure 24 illustrates the proportion of classes where LAIONet has significantly lower (blue) or higher (red) intra-class similarity compared to Image Net. As the figure suggests, using CLIP and Vi T-based image embeddings yields highly consistent results. Furthermore, using other embeddings only strengthens our argument about the additional diversity in LAIONet.

CLIP Vi T Conv Ne Xt BEi T image encoder

proportion of classes

LAIONet < Image Net LAIONet > Image Net

Figure 24: Comparing the intra-class similarity of LAIONet and Image Net for various encoders. The vertical axis shows the proportion of classes in which one dataset has significantly lower intra-class similarity than the other, where we define significance at 95% confidence levels.

We also repeat our experiment in Section 4.2 using various image encoders. Remember that in Section 4.2, we generated new datasets A, B, C, and D by utilizing both the multimodal similarity between LAION images and synset text, as well as the textual similarity between LAION text and synset text. We then demonstrated in Figure 10c the fraction of classes where these datasets show lower intra-class similarity than Image Net. In Figure 25, we show have repeated this experiment using other image embeddings. One can see that using any image encoder, going from dataset A to D, by increasing the role of multimodal similarity in dataset creation, the datasets become less and less diverse compared to Image Net. This once again reaffirms our conclusion regarding the potential bias that the image-to-selection link can introduce in dataset creation.

proportion of classes

w/ lower intra-class sim.

CLIP Vi T BEi T Conv Ne XT

Figure 25: The proportion of classes where the new datasets A, B, C, and D in Section 4.2 have lower-intra class similarities than Image Net, where we use various image encoders to calculate intra-class similarity.

Published in Transactions on Machine Learning Research (02/2025)

I Sample Images From LAIONet

We provide randomly picked images from both CLIP-filtered and MPNet-filtered LAIONet (Appendix B) in this section. These images have been chosen based on various levels of difficulty. Figure 26 illustrates the distribution of the recall@5 difference of the Vi T-base model for each common class between LAIONet and Image Net. We choose recall@5 as a more reliable metric where the multiplicity of labels is less of a concern. One can see that there exist classes for which the recall on LAIONet is less than Image Net for 0.5 or more. These are typically the classes for which LAIONet may have used a broader meaning for the synset or the images have appeared in a different context than Image Net. It is worth noting that these classes make up a very small portion of all classes and have minimal impact on evaluations, whether or not including such images is desired.

For the classes labeled on the graphs of Figure 26, we have provided 10 random images from all datasets in the following. Each figure comes with a potential explanation for the failure of Image Net models in the caption.

0.5 0.0 recall@5 (LAIONet - Image Net)

egyptian cat pillow hen

thimble torch bolete

(a) MPNet-filtered

1.0 0.5 0.0 recall@5 (LAIONet - Image Net)

brass grille

egyptian cat pillow

(b) CLIP-filtered

Figure 26: Distribution of recall@5 on LAIONet subtracted by recall@5 on Image Net. Only common classes are considered. The texts show the chosen classes for which example images are provided. The position of each text on the horizontal axis is the difference in recalls for that class.

letter_opener.n.01

saltshaker.n.01

plate_rack.n.01

comic_book.n.01

(a) MPNet-filtered

milk_can.n.01

table_lamp.n.01

pedestal.n.03

saltshaker.n.01

(b) CLIP-filtered

(c) Image Net validation

Figure 27: Egyption cat. Image Net models primarily struggle with Egyptian cat statues or painted graphics, which are not well-represented or are rare in the Image Net dataset.

Published in Transactions on Machine Learning Research (02/2025)

rubber_eraser.n.01

cradle.n.01

studio_couch.n.01

(a) MPNet-filtered

studio_couch.n.01

knee_pad.n.01

studio_couch.n.01

velvet.n.01

studio_couch.n.01

velvet.n.01

(b) CLIP-filtered

(c) Image Net validation

Figure 28: Pillow. Image Net models struggle to identify pillows when they deviate from the predominantly rectangular shape that is common in Image Net.

nematode.n.01

frying_pan.n.01

croquet_ball.n.01

(a) MPNet-filtered

croquet_ball.n.01

(b) CLIP-filtered

(c) Image Net validation

Figure 29: Bolete. Image Net models are challenged when a bolete appears in contexts outside of nature, such as being picked by a girl or found in a pan.

carpenter's_kit.n.01

letter_opener.n.01

carpenter's_kit.n.01

(a) MPNet-filtered

(b) CLIP-filtered

(c) Image Net validation

Figure 30: Thimble. Image Net models are challenged when the thimble is among many other items.

Published in Transactions on Machine Learning Research (02/2025)

can_opener.n.01

plunger.n.03

soap_dispenser.n.01

(a) MPNet-filtered

goblet.n.01

ballpoint.n.01

pinwheel.n.02

whistle.n.04

(b) CLIP-filtered

(c) Image Net validation

Figure 31: Torch. Image Net models have difficulty with recognizing graphical depictions of torches and identifying variations in torch orientation.

red-breasted_merganser.n.01

hamper.n.02

red-breasted_merganser.n.01

prairie_chicken.n.01

(a) MPNet-filtered

candle.n.01

comic_book.n.01

(b) CLIP-filtered

(c) Image Net validation

Figure 32: Hen. Graphical hens pose a challenge for Image Net models. MPNet-filtered images also include blue and green-winged teal hens, which are not present in the Image Net dataset.

magnetic_compass.n.01

space_heater.n.01

honeycomb.n.02

abacus.n.02

fire_screen.n.01

(a) MPNet-filtered

radiator.n.03

fire_screen.n.01

plate_rack.n.01

space_heater.n.01

window_screen.n.01

can_opener.n.01

moving_van.n.01

(b) CLIP-filtered

(c) Image Net validation

Figure 33: Grille. Image Net models only recognize grille when installed on a car. LAIONet images also include various kinds of grille which are not meant by Image Net class.

Published in Transactions on Machine Learning Research (02/2025)

bolo_tie.n.01

table_lamp.n.01

soap_dispenser.n.01

water_tower.n.01

switch.n.01

water_tower.n.01

prayer_rug.n.01

(a) MPNet-filtered

lipstick.n.01

fountain.n.01

lampshade.n.01

prayer_rug.n.01

switch.n.01

breastplate.n.01

medicine_chest.n.01

(b) CLIP-filtered

(c) Image Net validation

Figure 34: Brass. The intended concept of this class in Image Net is a memorial made of brass. However, LAIONet images correspond to the broader meaning and the model is not expected to predict that.