# intriguing_properties_of_generative_classifiers__1fce0b26.pdf Published as a conference paper at ICLR 2024 INTRIGUING PROPERTIES OF GENERATIVE CLASSIFIERS Priyank Jaini Google Deep Mind Kevin Clark Google Deep Mind Robert Geirhos Google Deep Mind What is the best paradigm to recognize objects discriminative inference (fast but potentially prone to shortcut learning) or using a generative model (slow but potentially more robust)? We build on recent advances in generative modeling that turn text-to-image models into classiﬁers. This allows us to study their behavior and to compare them against discriminative models and human psychophysical data. We report four intriguing emergent properties of generative classiﬁers: they show a record-breaking human-like shape bias (99% for Imagen), near human-level outof-distribution accuracy, state-of-the-art alignment with human classiﬁcation errors, and they understand certain perceptual illusions. Our results indicate that while the current dominant paradigm for modeling human object recognition is discriminative inference, zero-shot generative models approximate human object recognition data surprisingly well. Figure 1: Zero-shot generative classiﬁers achieve a human-level shape bias: 99% for Imagen, 93% for Stable Diffusion, 92% for Parti and 92 99% for individual human observers (96% on average). Most discriminative models are texture biased instead. Equal contribution Published as a conference paper at ICLR 2024 Figure 2: Classiﬁcation with a diffusion generative classiﬁer. Given a test image, such as a dog with clock texture (1), a text-to-image generative classiﬁer adds random noise (2) and then reconstructs the image conditioned on the prompt A bad photo of a for each class (3). The reconstructed image closest to the test image in L2 distance is taken as the classiﬁcation decision (4); this estimates the diffusion variational lower bound (Clark & Jaini, 2023). For visualization purposes, icons corresponding to the prompt class are superimposed on the reconstructed images. 1 INTRODUCTION Many discriminative classiﬁers perform well on data similar to the training distribution, but struggle on out-of-distribution images. For instance, a cow may be correctly recognized when photographed in a typical grassy landscape, but is not correctly identiﬁed when photographed on a beach (Beery et al., 2018). In contrast to many discriminatively trained models, generative text-to-image models appear to have acquired a detailed understanding of objects: they have no trouble generating cows on beaches or dog houses made of sushi (Saharia et al., 2022). This raises the question: If we could somehow get classiﬁcation decisions out of a generative model, how well would it perform out-ofdistribution? For instance, would it be biased towards textures like most discriminative models or towards shapes like humans (Baker et al., 2018; Geirhos et al., 2019; Wichmann & Geirhos, 2023)? We here investigate perceptual properties of generative classiﬁers, i.e., models trained to generate images from which we extract zero-shot classiﬁcation decisions. We focus on two of the most successful types of text-to-image generative models diffusion models and autoregressive models and compare them to both discriminative models (e.g., Conv Nets, vision transformers, CLIP) and human psychophysical data. Speciﬁcally, we focus on the task of visual object recognition (also known as classiﬁcation) of challenging out-of-distribution datasets and visual illusions. On a broader level, the question of whether perceptual processes such as object recognition are best implemented through a discriminative or a generative model has been discussed in various research communities for a long time. Discriminative inference is typically described as fast yet potentially prone to shortcut learning (Geirhos et al., 2020a), while generative modeling is often described as slow yet potentially more capable of robust inference (Di Carlo et al., 2021). The human brain appears to combine the best of both worlds, achieving fast inference but also robust generalization. How this is achieved, i.e. how discriminative and generative processes may be integrated has been described as the deep mystery in vision (Kriegeskorte, 2015, p. 435) and seen widespread interest in Cognitive Science and Neuroscience (see Di Carlo et al., 2021, for an overview). This mystery dates back to the idea of vision as inverse inference proposed more than 150 years ago by Published as a conference paper at ICLR 2024 Figure 3: Out-of-distribution accuracy across 17 challenging datasets (Geirhos et al., 2021). Detailed results for all parametric datasets are plotted in Figure 5; Table 3 lists accuracies. von Helmholtz (1867), who argued that the brain may need to infer the likely causes of sensory information a process that requires a generative model of the world. In machine learning, this idea inspired approaches such as the namesake Helmholtz machine (Dayan et al., 1995), the concept of vision as Bayesian inference (Yuille & Kersten, 2006) and other analysis-by-synthesis methods (Revow et al., 1996; Bever & Poeppel, 2010; Schott et al., 2018; Zimmermann et al., 2021). However, when it comes to challenging real-world tasks like object recognition from photographs, the ideas of the past often lacked the methods (and compute power) of the future: until very recently, it was impossible to compare generative and discriminative models of object recognition simply because the only models capable of recognizing challenging images were standard discriminative models like deep convolutional networks (Krizhevsky et al., 2012; He et al., 2015) and vision transformers (Dosovitskiy et al., 2021). Excitingly, this is changing now and thus enables us to compare generative classiﬁers against both discriminative models and human object recognition data. Concretely, in this work, we study the properties of generative classiﬁers based on three different text-to-image generative models: Stable Diffusion (SD), Imagen, and Parti on 17 challenging OOD generalization datasets from the model-vs-humans toolbox (Geirhos et al., 2021). We compare the performance of these generative classiﬁers with 52 discriminative models and human psychophysical data. Based on our experiments, we observe four intriguing properties of generative classiﬁers: 1. a human-like shape bias (Subsection 3.1), 2. near human-level out-of-distribution accuracy (Subsection 3.2), 3. state-of-the-art error consistency with humans (Subsection 3.3), 4. an understanding of certain perceptual illusions (Subsection 3.4). 2 METHOD: GENERATIVE MODELS AS ZERO-SHOT CLASSIFIERS We begin with a dataset, Dn := {(x1, y1), (x2, y2) , (xn, yn)} of n images where each image belongs to one of K classes [y K] := {y1, , yk}. Our method classiﬁes an image by predicting the most probable class assignment assuming a uniform prior over classes: y = arg max yk p(y = yk|x) = arg max yk p(x|y = yk) p(y = yk) = arg max yk log p(x|y = yk) (1) A generative classiﬁer (Ng & Jordan, 2001) uses a conditional generative model to estimate the likelihood pθ(x|y = yk) where θ are the model parameters. Generative models: We study the properties of three different text-to-image generative models namely Imagen (Saharia et al., 2022) which is a pixel space based diffusion model, Stable Diffusion (SD) (Rombach et al., 2022) which is a latent space based diffusion model, and Parti (Yu et al., 2022) which is a sequence-to-sequence based autoregressive model. Since these models are conditioned Published as a conference paper at ICLR 2024 Figure 4: Error consistency across 17 challenging datasets (Geirhos et al., 2021). This metric measures whether errors made by models align with errors made by humans (higher is better). on text prompts rather than class labels, we modify each label, yk, to a text prompt using the template yk A bad photo of a yk. to generate classiﬁcation decisions. Conceptually, our approach to obtain classiﬁcation decisions is visualized in Figure 2. Following Clark & Jaini (2023), we generate classiﬁcation decisions from diffusion models like Stable Diffusion and Imagen by approximating the conditional log-likelihood log pθ(x|y = yk) using the diffusion variational lower bound (see Appendix A for a background on diffusion models): y = arg max yk log pθ(x|y = yk) arg min yk Eϵ,t wt x xθ(xt, yk, t) 2 2] (2) For SD, x is a latent representations whereas for Imagen x consists of raw image pixels. Evaluating pθ(x|y = yk) for Parti amounts to performing one forward pass of the model since it is an autoregressive model that provides an exact conditional likelihood. Thus, for each of these models we evaluate the conditional likelihood, pθ(x|y = yk), for each class yk [y K] and assign the class with the highest likelihood obtained via Equation (1). Model-vs-human datasets: We study the performance of these generative classiﬁers on 17 challenging out-of-distribution (OOD) datasets proposed in the model-vs-human toolbox (Geirhos et al., 2021). Of these 17 datasets, ﬁve correspond to a non-parametric single manipulation (sketches, edge-ﬁltered images, silhouettes, images with a texture-shape cue conﬂict, and stylized images where the original image texture is replaced by the style of a painting). The other twelve datasets consist of parametric image distortions like low-pass ﬁltered images, additive uniform noise, etc. These datasets are designed to test OOD generalization for diverse models in comparison to human object recognition performance. The human data consists of 90 human observers with a total of 85,120 trials collected in a dedicated psychophysical laboratory on a carefully calibrated screen (see Geirhos et al., 2021, for details). This allows us to compare classiﬁcation data for zero-shot generative models, discriminative models and human observers in a comprehensive, uniﬁed setting. Preprocessing: We preprocess the 17 datasets in the model-vs-human toolbox by resizing the images to 64 64 resolution for Imagen, 256 256 for Parti, and 512 512 for SD since these are the resolutions for the each of the base models respectively. We use the prompt, A bad photo of a yk, for each dataset and every model. Although Imagen (Saharia et al., 2022) is a cascaded diffusion model consisting of a 64 64 low-resolution model and two super-resolution models, we only use the 64 64 base model for our experiments here. We use v1.4 of SD (Rombach et al., 2022) for our experiments that uses a pre-trained text encoder from CLIP to encode text and a pre-trained VAE to map images to a latent space. Finally, we use the Parti-3B model (Yu et al., 2022) consisting of an image tokenizer and an encoder-decoder transformer model that converts text-to-image generation to a sequence-to-sequence modeling problem. Published as a conference paper at ICLR 2024 colour greyscale Colour Classification accuracy (a) Colour / greyscale true opponent Colour Classification accuracy (b) True / false colour original equalised Power spectrum Classification accuracy (c) Power equalised 0 90 180 270 Rotation angle [ ] Classification accuracy (d) Rotation 0.0 .03 .05 .1 .2 .35 .6 .9 Uniform noise width Classification accuracy (e) Uniform noise 100 50 30 15 10 5 3 1 Contrast in percent Classification accuracy (f) Contrast 0 1 3 5 7 10 15 40 Filter standard deviation Classification accuracy (g) Low-pass inf 3.0 1.5 1.0 .7 .55 .45 .4 Filter standard deviation Classification accuracy (h) High-pass 0 30 60 90 120 150 180 Phase noise width [ ] Classification accuracy (i) Phase noise 0 1 2 3 4 5 6 7 Log2 of 'reach' parameter Classification accuracy (j) Eidolon I 0 1 2 3 4 5 6 7 Log2 of 'reach' parameter Classification accuracy (k) Eidolon II 0 1 2 3 4 5 6 7 Log2 of 'reach' parameter Classification accuracy (l) Eidolon III Figure 5: Detailed out-of-distribution accuracy for Imagen, Stable Diffusion and Parti in comparison to human observers. While not always aligning perfectly with human accuracy, the overall robustness achieved by Imagen and Stable Diffusion is comparable to that of human observers even though these models are zero-shot, i.e. neither designed nor trained to do classiﬁcation. Baseline models for comparison: As baseline discriminative classiﬁers, we compare Imagen, SD, and Parti against 52 diverse models from the model-vs-human toolbox (Geirhos et al., 2021) that are either trained or ﬁne-tuned on Image Net, three Vi T-22B variants (Dehghani et al., 2023) (very large 22B parameter vision transformers) and CLIP (Radford et al., 2021) as a zero-shot classiﬁer baseline. The CLIP model is based on the largest version, Vi T-L/14@224px, and consist of vision and text transformers trained with contrastive learning. We use the CLIP model that uses an ensemble of 80 different prompts for classiﬁcation (Radford et al., 2021). We plot all baseline discriminative models in grey and human subject data in red. Metrics: We compare all the models over the 17 OOD datasets based on three metrics: (a) shape bias, (b) OOD accuracy and, (c) error consistency. Shape bias is deﬁned by Geirhos et al. (2019) as the fraction of decisions that are identical to the shape label of an image divided by the fraction of decisions for which the model output was identical to either the shape or the texture label on a dataset with texture-shape cue conﬂict. OOD accuracy is deﬁned as the fraction of correct decisions for a dataset that is not from the training distribution. Error consistency (see Geirhos et al., 2020b, for details) is measured in Cohen s kappa (Cohen, 1960) and indicates whether two decision makers (e.g., a model and a human observer) systematically make errors on the same images. If that is the case, it may be an indication of deeper underlying similarities in terms of how they process images and recognize objects. Error consistency between models f1 and f2 is deﬁned over a dataset on which both models are evaluated on exactly the same images and output a label prediction; the metric indicates the fraction of images on which 1f1(x)=yx is identical to 1f2(x)=yx (i.e., both models are either correct or wrong on the same image) when corrected for chance agreement. This ensures that an error consistency value of 0 corresponds to chance agreement, positive values indicate beyondchance agreement (up to 1.0) and negative values indicate systematic disagreement (down to -1.0). Published as a conference paper at ICLR 2024 model model type shape bias OOD accuracy error consist. Imagen (1 prompt) zero-shot 99% 0.71 0.31 Stable Diffuson (1 prompt) zero-shot 93% 0.69 0.26 Parti (1 prompt) zero-shot 92% 0.58 0.23 CLIP (1 prompt) zero-shot 80% 0.55 0.26 CLIP (80 prompts) zero-shot 57% 0.71 0.28 Vi T-22B-384 trained on 4B images discriminative 87% 0.80 0.26 Vi T-L trained on IN-21K discriminative 42% 0.73 0.21 RN-50 trained on IN-1K discriminative 21% 0.56 0.21 RN-50 trained w/ diffusion noise discriminative 57% 0.57 0.24 RN-50 train+eval w/ diffusion noise discriminative 78% 0.43 0.18 Table 1: Benchmark results for model-vs-human metrics (Geirhos et al., 2021). Within each model type (zero-shot vs. discriminative), the best result for each category is shown in bold. 3 RESULTS: FOUR INTRIGUING PROPERTIES OF GENERATIVE CLASSIFIERS 3.1 HUMAN-LIKE SHAPE BIAS Introduced by Geirhos et al. (2019), the shape bias of a model indicates to which degree the model s decisions are based on object shape, as opposed to object texture. We study this phenomenon using the cue-conﬂict dataset which consists of images with shape-texture cue conﬂict. As shown in Geirhos et al. (2021), most discriminative models are biased towards texture whereas humans are biased towards shape (96% shape bias on average; 92% to 99% for individual observers). Interestingly, we ﬁnd that all three zero-shot generative classiﬁers show a human-level shape bias: Imagen achieves a stunning 99% shape bias, Stable Diffusion 93% and Parti a 92% shape bias. As we show in Figure 1, Imagen closely matches or even exceeds human shape bias across nearly all categories, achieving a previously unseeen shape bias of 99%. In Table 1, we report that all three generative classiﬁers signiﬁcantly outperform Vi T-22B (Dehghani et al., 2023), the previous stateof-the-art method in terms of shape bias, even though all three models are smaller in size, trained on less data, and unlike Vi T-22B were not designed for classiﬁcation. 3.2 NEAR HUMAN-LEVEL OOD ACCURACY Humans excel at recognizing objects even if they are heavily distorted. Do generative classiﬁers also possess similar out-of-distribution robustness? We ﬁnd that Imagen and Stable Diffusion achieve an overall accuracy that is close to human-level robustness (cf. Figure 3) despite being zero-shot models; these generative models are outperformed by some very competitive discriminative models like Vi T-22B achieving super-human accuracies. The detailed plots in Figure 5 show that on most datasets (except rotation and high-pass), the performance of all three generative classiﬁers approximately matches human responses. Additional results are in Table 3 and Figure 11 and 12 in the appendix. Notably, all three models are considerably worse than humans in recognizing rotated images. Curiously, these models also struggle to generate rotated images when prompted with the text A rotated image of a dog. / An upside down image of a dog. etc. This highlights an exciting possibility: evaluating generative models on downstream tasks like OOD datasets may be a quantitative way of gaining insights into the generation capabilities and limitations of these models. On high-pass ﬁltered images, Imagen performs much worse than humans whereas SD and Parti exhibit more robust performance. The difference in performance of Imagen and SD may be attributed to the weighting function used in Equation (2). Our choice of weighting function, wt := exp( 7t), as used in Clark & Jaini (2023) tends to give higher weight to the lower noise levels and is thus bad at extracting decisions for high-frequency images. SD on the other hand operates in the latent space and thus the weighting function in Equation (2) effects its decisions differently than Imagen. Nevertheless, this indicates that even though Imagen and SD are diffusion-based models, they exhibit very different sensitivities to high spatial frequencies. Despite those two datasets where generative clas- Published as a conference paper at ICLR 2024 Figure 6: Model-to-model error consistency for cue conﬂict images. The matrix is sorted according to mean error consistency with humans (higher consistency from left to right / top to bottom). siﬁers show varied performance, they overall achieve impressive zero-shot classiﬁcation accuracy (near human-level performance as shown in Figure 3). 3.3 SOTA ERROR CONSISTENCY WITH HUMAN OBSERVERS Humans and models may both achieve, say, 90% accuracy on a dataset but do they make errors on the same 10% of images, or on different images? This is measured by error consistency (Geirhos et al., 2020b). In Figure 4, we show the overall results for all models across the 17 datasets. While a substantial gap towards human-to-human error consistency remains, Imagen shows the most humanaligned error patterns, surpassing previous state-of-the-art (SOTA) set by Vi T-22B, a large vision transformer (Dehghani et al., 2023). SD also exhibits error consistency closer to humans but trails signiﬁcantly behind Imagen. These ﬁndings appear consistent with the MNIST results by Golan et al. (2020) reporting that a generative model captures human responses better than discriminative models. Additionally, a matrix plot of error consistency of all the models on cue-conﬂict images is shown in Figure 6. Interestingly, the plot shows a clear dichotomy between discriminative models that exhibit error patterns similar to each other, and generative models whose error patterns more closely match humans, thus they end up in the human cluster. While overall a substantial gap between the best models and human-to-human consistency remains (Figure 4), Imagen best captures human classiﬁcation errors despite never being trained for classiﬁcation. We report more detailed results in the appendix in Table 2 and Figures 11-18. 3.4 UNDERSTANDING CERTAIN VISUAL ILLUSIONS Beyond quantitative benchmarking, we investigated a more qualitative aspect of generative models: whether they can understand certain visual illusions. In human perception, illusions often reveal aspects of our perceptual abilities that would otherwise go unnoticed. We therefore tested generative models on images that are visual illusions for humans. In contrast to discriminative models, generative classiﬁers offer a straightforward way to test illusions: for bistable images such as the famous Published as a conference paper at ICLR 2024 Figure 7: Generative classiﬁers understand certain visual illusions as indicated by their ability to reconstruct ambiguous images in a way that aligns with how humans perceive those images. For instance, they reconstruct a right-facing rabbit vs. a left-facing duck in the case of the bistable rabbit-duck illusion and place the face in the right location and pose for an image where humans show pareidolia (seeing patterns in things, like a face in a rock). Image attribution: Appendix D rabbit-duck, we can prompt them to reconstruct based on an image of a duck and an image of a rabbit . If they can (a) reconstruct images resembling the respective animal and (b) they place the reconstructed animal in the same location and pose as humans would, this can be seen as evidence that they understand the illusion. We ﬁnd that this is indeed the case for generative models like Imagen, Stable Diffusion, and Muse (Chang et al., 2023). Since Parti cannot directly be used for image editing, we used Muse a masking-based generative model that operates on VQ-VAE tokens similar to Parti for this analysis instead of Parti; Muse on the other hand cannot be used as a classiﬁer as explained in Appendix B. This ensures that our conclusions are not limited to diffusion models but cover generative models more broadly. In Figure 7, we use four different images that are either bistable illusions for humans or images for which humans exhibit pareidolia (a tendency to see patterns in things, like a face in a rock). In all cases, the text-to-image generative models are able to recognize the illusion and recreate correct images conditioned on the respective text prompts. This indicates that these generative models share certain bistable illusions and pareidolia with human visual perception. 4 ANALYSIS: WHERE DOES THE INCREASED SHAPE BIAS ORIGINATE FROM? In Section 3, we highlighted four intriguing properties of generative classiﬁers. The most striking emergent property amongst the four is the human-level shape bias demonstrated by these generative classiﬁers; a bias that no discriminative models so far was able to show. A natural question to ask is thus: What aspect of these generative models causes such an increase in shape bias? We observed that for diffusion models like Imagen and Stable Diffusion, the recreated images used for classiﬁcation were usually devoid of texture cues (for example see Figure 2). We posit that the denoising process used for classiﬁcation (cf. Equation (2)) of the diffusion model might bias it towards capturing low-frequency information and thereby focuses on the global structure of the image as captured by the shape of an object. Indeed, in Figure 5, we observe that while generative classiﬁers are well within the range of other models for most datasets, they demonstrate very distinctive results on low-pass ﬁltered images (also known as blurred); Imagen the most shape-biased model is on par with humans. Conversely, Imagen struggles to classify high-pass images. Could it be the case that these generative models put more emphasis on lower spatial frequencies whereas most textures are high frequency in nature? Published as a conference paper at ICLR 2024 If this is indeed the case, then performance on blurred images and shape bias should have a significant positive correlation. We tested this hypothesis empirically and indeed found a strong positive and highly signiﬁcant correlation between the two (Pearson s r(58) = .59, p < 0.001; Spearman s r(58) = .64, p < 0.001); a ﬁnding consistent with Subramanian et al. (2023) observing a significant correlation between shape bias and spatial frequency channel bandwidth. While correlations establish a connection between the two, they are of course not evidence for a causal link. We hypothesized that the noise applied during diffusion training might encourage models to ignore highfrequency textures and focus on shapes. To test this prediction, we trained a standard Res Net-50 on Image Net-1K (Russakovsky et al., 2015) by adding diffusion-style noise as a data augmentation during both training and evaluation. Interestingly, such a model trained with data augmented with diffusion style noise causes an increase in shape bias from 21% for a standard Res Net-50 to 78% as shown in Figures 1 and 14 and Table 1. This simple trick achieves a substantially higher shape bias than the 62% observed by prior work when combining six different techniques and augmentations (Hermann et al., 2020). This result shows that (i) diffusion style training biases the models to emphasize low spatial frequency information and (ii) models that put emphasis on lower spatial frequency noise exhibit increased shape bias. Other factors such as generative training, the quality and quantity of data, and the use of a powerful language model might also play a role. However, given the magnitude of the observed change in shape bias this indicates that diffusion-style training is indeed a crucial factor. 5 DISCUSSION Motivation. While generative pre-training has been prevalent in natural language processing, in computer vision it is still common to pre-train models on labeled datasets such as Image Net (Deng et al., 2009) or JFT (Sun et al., 2017). At the same time, generative text-to-image models like Stable Diffusion, Imagen, and Parti show powerful abilities to generate photo-realistic images from diverse text prompts. This suggests that these models learn useful representations of the visual world, but so far it has been unclear how their representations compare to discriminative models. Furthermore, discriminative models have similarly dominated computational modeling of human visual perception, even though the use of generative models by human brains has long been hypothesized and discussed. In this work, we performed an empirical investigation on out-of-distribution datasets to assess whether discriminative or generative models better ﬁt human object recognition data. Key results. We report four intriguing human-like properties of generative models: (1) Generative classiﬁers are the ﬁrst models that achieve a human-like shape bias (92 99%); (2) they achieve near human-level OOD accuracy despite being zero-shot classiﬁers that were neither trained nor designed for classiﬁcation; (3) one of them (Imagen) shows the most human-aligned error patterns that machine learning models have achieved to date; and (4) all investigated models qualitatively capture the ambiguities of images that are perceptual illusions for humans. Implications for human perception. Our results establish generative classiﬁers as one of the leading behavioral models of human object recognition. While we certainly don t resolve the deep mystery of vision (Kriegeskorte, 2015, p. 435) in terms of how brains might combine generative and discriminative models, our work paves the way for future studies that might combine the two. Quoting Luo (2022, p. 22) on diffusion, It is unlikely that this is how we, as humans, naturally model and generate data; we do not seem to generate novel samples as random noise that we iteratively denoise. we fully agree, but diffusion may just be one of many implementational ways to arrive at a representation that allows for powerful generative modeling. Human brains are likely to use a different implementation, but they still may (or may not) end up with a similar representation. Implications for machine perception. We provide evidence for the beneﬁts of generative pretraining, particularly in terms of zero-shot performance on challenging out-of-distribution tasks. In line with recent work on using generative models for depth estimation (Zhao et al., 2023) or segmentation (Burgert et al., 2022; Brempong et al., 2022; Li et al., 2023), this makes the case for generative pre-training as a compelling alternative to contrastive or discriminative training for vision tasks. Additionally, our experiments provide a framework to ﬁnd potential bugs of generative models through classiﬁcation tasks. For example, all the generative models performed poorly on the rotation dataset; those models also struggled to generate rotated or upside-down images of objects. Similar experiments could be used to evaluate generative models for undesirable behaviour, toxicity and bias. Published as a conference paper at ICLR 2024 Limitations. A limitation of the approach we used in the paper is the computational speed (as we also alluded to in Section 1). The approach does not yield a practical classiﬁer. Secondly, all three models have different model sizes, input resolutions, and are trained on different datasets for different amounts of time, so the comparison is not perfect. Furthermore, different models use different language encoders which may be a confounding factor. Through including diverse generative models, our comparisons aim to highlight the strengths and weaknesses of generative models. We explore a number of ablations and additional analyses in the appendix, including shape bias results for image captioning models that are of generative nature, but not trained for text-toimage modeling. Future directions. Beyond the questions regarding how biological brains might combine generative and discriminative models, we believe it will be interesting to study how, and to what degree, language cross-attention inﬂuences the intriguing properties we ﬁnd. Moreover, is denoising diffusion training a crucial component that explains the impressive performance of Imagen and SD? We hope our ﬁndings show how intriguing generative classiﬁers are for exploring exciting future directions. ACKNOWLEDGMENTS We would like to express our gratitude to the following colleagues (in alphabetical order) for helpful discussions and feedback: David Fleet, Katherine Hermann, Been Kim, Alex Ku, Jon Shlens, and Kevin Swersky. Furthermore, we would like to thank Michael Tschannen and Manoj Kumar for suggesting the Cap Pa model analysis and providing the corresponding models. Nicholas Baker, Hongjing Lu, Gennady Erlikhman, and Philip J Kellman. Deep convolutional networks do not classify based on global object shape. PLo S Computational Biology, 14(12): e1006613, 2018. Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. In Proceedings of the European Conference on Computer Vision, pp. 456 473, 2018. Thomas G Bever and David Poeppel. Analysis by synthesis: a (re-) emerging program of research for language and vision. Biolinguistics, 4(2-3):174 200, 2010. Emmanuel Asiedu Brempong, Simon Kornblith, Ting Chen, Niki Parmar, Matthias Minderer, and Mohammad Norouzi. Denoising Pretraining for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4175 4186, 2022. Ryan Burgert, Kanchana Ranasinghe, Xiang Li, and Michael S Ryoo. Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors. ar Xiv preprint ar Xiv:2211.13224, 2022. Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-To-Image Generation via Masked Generative Transformers. ar Xiv preprint ar Xiv:2301.00704, 2023. Kevin Clark and Priyank Jaini. Text-to-image diffusion models are zero-shot classiﬁers. ar Xiv preprint ar Xiv:2303.15233, 2023. Jacob Cohen. A coefﬁcient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37 46, 1960. Peter Dayan, Geoffrey E Hinton, Radford M Neal, and Richard S Zemel. The Helmholtz machine. Neural Computation, 7(5):889 904, 1995. Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pp. 7480 7512. PMLR, 2023. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009. Published as a conference paper at ICLR 2024 James J Di Carlo, Ralf Haefner, Leyla Isik, Talia Konkle, Nikolaus Kriegeskorte, Benjamin Peters, Nicole Rust, Kim Stachenfeld, Joshua B Tenenbaum, Doris Tsao, et al. How does the brain combine generative models and direct discriminative computations in high-level vision? 2021. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations, 2021. Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. Image Net-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations, 2019. Robert Geirhos, J orn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut Learning in Deep Neural Networks. Nature Machine Intelligence, 2:665 673, 2020a. Robert Geirhos, Kristof Meding, and Felix A Wichmann. Beyond accuracy: quantifying trialby-trial behaviour of CNNs and humans by measuring error consistency. Advances in Neural Information Processing Systems, 33, 2020b. Robert Geirhos, Kantharaju Narayanappa, Benjamin Mitzkus, Tizian Thieringer, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Partial success in closing the gap between human and machine vision. Advances in Neural Information Processing Systems, 34:23885 23899, 2021. Tal Golan, Prashant C Raju, and Nikolaus Kriegeskorte. Controversial stimuli: Pitting neural networks against each other as models of human cognition. Proceedings of the National Academy of Sciences, 117(47):29330 29337, 2020. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers: Surpassing human-level performance on Image Net classiﬁcation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1026 1034, 2015. Katherine Hermann, Ting Chen, and Simon Kornblith. The origins and prevalence of texture bias in convolutional neural networks. Advances in Neural Information Processing Systems, 33:19000 19015, 2020. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020. Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in Neural Information Processing Systems, 34:21696 21707, 2021. Diederik P Kingma and Max Welling. Auto-encoding Variational Bayes. International Conference on Learning Representations, 2014. Nikolaus Kriegeskorte. Deep neural networks: a new framework for modeling biological vision and brain information processing. Annual Review of Vision Science, 1:417 446, 2015. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Image Net classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097 1105, 2012. Daiqing Li, Huan Ling, Amlan Kar, David Acuna, Seung Wook Kim, Karsten Kreis, Antonio Torralba, and Sanja Fidler. Dream Teacher: Pretraining Image Backbones with Deep Generative Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16698 16708, 2023. Calvin Luo. Understanding diffusion models: A uniﬁed perspective. ar Xiv preprint ar Xiv:2208.11970, 2022. Published as a conference paper at ICLR 2024 Andrew Ng and Michael Jordan. On discriminative vs. generative classiﬁers: A comparison of logistic regression and naive Bayes. Advances in Neural Information Processing Systems, 14, 2001. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748 8763. PMLR, 2021. Michael Revow, Christopher KI Williams, and Geoffrey E Hinton. Using generative models for handwritten digit recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(6):592 606, 1996. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj orn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684 10695, 2022. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C Berg, and Li Fei-Fei. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3):211 252, 2015. Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. Advances in Neural Information Processing Systems, 2022. Lukas Schott, Jonas Rauber, Matthias Bethge, and Wieland Brendel. Towards the ﬁrst adversarially robust neural network model on MNIST. In International Conference on Learning Representations, 2018. Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256 2265. PMLR, 2015. Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. Advances in Neural Information Processing Systems, 33:12438 12448, 2020. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. ar Xiv preprint ar Xiv:2011.13456, 2020. Ajay Subramanian, Elena Sizikova, Najib J Majaj, and Denis G Pelli. Spatial-frequency channels, shape bias, and adversarial robustness. In Advances in Neural Information Processing Systems, 2023. Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, pp. 843 852, 2017. Michael Tschannen, Manoj Kumar, Andreas Steiner, Xiaohua Zhai, Neil Houlsby, and Lucas Beyer. Image Captioners Are Scalable Vision Learners Too. In Advances in Neural Information Processing Systems, 2023. Hermann von Helmholtz. Handbuch der physiologischen Optik: mit 213 in den Text eingedruckten Holzschnitten und 11 Tafeln, volume 9. Voss, 1867. Felix A Wichmann and Robert Geirhos. Are Deep Neural Networks Adequate Behavioral Models of Human Visual Perception? Annual Review of Vision Science, 9, 2023. Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for contentrich text-to-image generation. ar Xiv preprint ar Xiv:2206.10789, 2(3):5, 2022. Published as a conference paper at ICLR 2024 Alan Yuille and Daniel Kersten. Vision as Bayesian inference: analysis by synthesis? Trends in Cognitive Sciences, 10(7):301 308, 2006. Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu, Jie Zhou, and Jiwen Lu. Unleashing Textto-Image Diffusion Models for Visual Perception. ar Xiv preprint ar Xiv:2303.02153, 2023. Roland S Zimmermann, Lukas Schott, Yang Song, Benjamin A Dunn, and David A Klindt. Scorebased generative classiﬁers. ar Xiv preprint ar Xiv:2110.00473, 2021. Published as a conference paper at ICLR 2024 Table of Contents A Background on diffusion models 14 B Muse as a classiﬁer 15 C Limitations 15 D Image attribution 15 E Details on Res Net-50 training with diffusion noise 17 F Shape bias of image captioning models 17 G Additional plots for model-vs-human benchmark 18 H Quantitative benchmark scores and rankings 19 A BACKGROUND ON DIFFUSION MODELS Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2020; Song & Ermon, 2020) are latent variable generative models deﬁned by a forward and reverse Markov chain. Given an unknown data distribution, q(x0), over observations, x0 Rd, the forward process corrupts the data into a sequence of noisy latent variables, x1:T := {x1, x2, , x T }, by gradually adding Gaussian noise with a ﬁxed schedule deﬁned as: q(x1:T |x0) := t=1 q(xt|xt 1) (3) where q(xt|xt 1) := Normal(xt; 1 βtxt 1, βt I). The reverse Markov process gradually denoises the latent variables to the data distribution with learned Gaussian transitions starting from Normal(x T ; 0, I) i.e. pθ(x0:T ) := p(x T ) t=0 pθ(xt 1|xt) pθ(xt 1|xt) := Normal xt 1; µθ(xt, t), Σθ(xt, t) . The aim of training is for the forward process distribution {xt}T t=0 to match that of the reverse process { xt}T t=0 i.e., the generative model pθ(x0) closely matches the data distribution q(x0). Speciﬁcally, these models can be trained by optimizing the variational lower bound of the marginal likelihood (Ho et al., 2020; Kingma et al., 2021): log pθ(x0) VLB(x0) := LPrior + LRecon + LDiﬀusion LPrior and LRecon are the prior and reconstruction loss that can be estimated using standard techniques in the literature (Kingma & Welling, 2014). The (re-weighted) diffusion loss can be written as: LDiﬀusion = Ex0,ε,t h wt x0 xθ(xt, t) 2 2 i with x0 q(x0), ε Normal(0, I), and t U([0, T]). Here, wt is a weight assigned to the timestep, and xθ(xt, t) is the model s prediction of the observation x0 from the noised observation xt. Diffusion models can be conditioned on additional inputs like class labels, text prompts, segmentation masks or low-resolution images, in which case xθ also takes a conditioning signal y as input. Published as a conference paper at ICLR 2024 Design choices zero-shot classiﬁcation using diffusion models: We follow the exact experiment setting here as in Clark & Jaini (2023) for Imagen and Stable Diffusion to obtain classiﬁcation decisions. Speciﬁcally, we use the heuristic weighting function wt := exp( 7t) in Equation (2) to aggregate scores across multiple time steps. We use a single prompt for each image instead of an ensemble of prompts as used in CLIP to keep the experiments simple. Loss function: We use the L2 loss function for diffusion-based models since it approximates the diffusion variational lower bound (see Equation (2)) and thus results in a Bayesian classiﬁer. Furthermore, both Stable Diffusion and Imagen are trained with the L2 loss objective. Thus, a different loss function will no longer result in a Bayesian classiﬁer and will not work well due to differences from the training paradigm. Algorithm 1 Classiﬁcation using diffusion models. given: Example to classify x, diffusion model w/ params θ, weighting function w. Map from classes to diffusion model scores. scores = {yi : [] for yi [y K]} n = 0 n < max scores: n = n + 1 Noise the image t U([0, 1]) xt q(xt|x) Score against the remaining classes. for yi scores: add wt x xθ xt, φ(yi), t 2 2 to scores[yi] return y B MUSE AS A CLASSIFIER In Figure 8, we visualize why we were not able to include Muse as a (successful) classiﬁer in our experiments. Even on clean, undistorted images Muse achieved only approximately chance-level accuracy. This may, however, just be a limitation on how we attempted to extract classiﬁcation decisions out of the model; it is very well possible that other approaches might work better. C LIMITATIONS As mentioned in the introduction, using a generative model comes with advantages and disadvantages: potentially better generalization currently comes at the cost of being slower computationally compared to standard discriminative models. While this doesn t matter much for the purpose of analyses, it is a big drawback in practical applications and any approaches that improve speed would be most welcome in particular, generating at least one prediction (i.e., image generation) per class as we currently do is both expensive and slow. Furthermore, the models we investigate all differ from another in more than one ways. For instance, their training data, architecture, and training procedure is not identical, thus any differences between the models cannot currently be attributed to a single factor. That said, through the inclusion of a set of diverse models covering pixel-based diffusion, latent space diffusion, and autoregressive models we seek to at least cover a variety of generative classiﬁers in order to ensure that the conclusions we draw are not limited to a narrow set of generative models. D IMAGE ATTRIBUTION Rabbit-duck image: Attribution: Unknown source, Public domain, via Wikimedia Commons. Published as a conference paper at ICLR 2024 Figure 8: Muse as a classiﬁer. This ﬁgure illustrates why Muse reconstructions are often challenging to use for classiﬁcation. Given a starting image of a bottle with bear texture, we here plot 16 reconstructions each prompted with a different label. The L2 distance to the original image is shown in red for all categories except the one with the lowest distance (here: top left), for which the distance is plotted in green. The images with the lowest distance (row 1 column 1: airplane; row 2 column 3: car, row 4 column 4: truck) appear to be categories for which the model is unable to generate a realistic reconstruction, and thus simply sticks close to the original image. The image generated by prompting with the correct shape category (bottle) is shown in row 2 column 2. Since Muse returns images fairly close to the original one if it is not able to generate a realistic reconstruction, the approach of measuring L2 distance between original and reconstruction is not a feasible classiﬁcation approach for Muse. Link: https://upload.wikimedia.org/wikipedia/commons/9/96/ Duck-Rabbit.png Rock image: Attribution: Mirabeau, CC BY-SA 3.0, via Wikimedia Commons. Link: https://commons.wikimedia.org/wiki/File:Visage_dans_un_rocher. jpg Vegetable portrait image: Attribution: Giuseppe Arcimboldo, Public domain, via Wikimedia Commons. Link: https://upload.wikimedia.org/wikipedia/commons/4/49/ Arcimboldo_Vegetables.jpg Woman image: Attribution: W. E. Hill, Public domain, via Wikimedia Commons. Link: https://upload.wikimedia.org/wikipedia/commons/5/5f/My_Wife_ and_My_Mother-In-Law_%28Hill%29.svg Published as a conference paper at ICLR 2024 E DETAILS ON RESNET-50 TRAINING WITH DIFFUSION NOISE We trained a Res Net-50 in exactly the same way as for standard, 90 epoch JAX Image Net training with the key difference that we added diffusion noise as described by the code below. Since this makes the training task substantially more challenging, we trained the model for 300 instead of 90 epochs. The learning rate was 0.1 with a cosine learning rate schedule, 5 warmup epochs, SGD momentum of 0.9, weight decay of 0.0001, and a per device batch size of 64. For diffusion style denoising we used a ﬂag named sqrt alphas which ensures that the noise applied doesn t completely destroy the image information in most cases. The input to the Add Noise method is in the [0, 1] range; the output of the Add Noise method exceeds this bound due to the noise; we did not normalize / clip it afterwards but instead directly fed this into the network. We did not perform Image Net mean/std normalization. The training augmentations we used were 1. random resized crop, 2. random horizontal ﬂip, 3. add diffusion noise. We did not optimize any of those settings with respect to any of the observed ﬁndings (e.g., shape bias) since we were interested in generally applicable results. Listing 1: Python example of how diffusion noise was added as a data augmentation technique. # Copyright 2023 Deep Mind Technologies Limited . #SPDX License I d e n t i f i e r : Apache 2.0 import d a t a c l a s s e s from typing import Sequence , Tuple import g r a i n . t e n s o r f l o w as t f g r a i n import jax import t e n s o r f l o w as t f @dataclasses . d a t a c l a s s ( frozen =True ) c l a s s Add Noise ( t f g r a i n . Map Transform ) : Adds d i f f u s i o n s t y l e noise to the image . s q r t a l p h a s : bool = True def map( s e l f , f e a t u r e s : F l a t F e a t u r e s ) > F l a t F e a t u r e s : image = f e a t u r e s [ image ] alpha = t f . random . uniform ( [ ] ) i f s e l f . s q r t a l p h a s : alpha = t f . s q r t ( alpha ) s t d = t f . s q r t (1 alpha alpha ) image = image alpha + s t d t f . random . normal ( image . shape ) f e a t u r e s [ image ] = image f e a t u r e s [ n o i s e l e v e l ] = s t d return f e a t u r e s F SHAPE BIAS OF IMAGE CAPTIONING MODELS We discovered that text-to-image generative models, when turned into classiﬁers, have a high shape bias. We were interested in understanding whether other forms of generative modeling (beyond text-to-image) could lead to similar results. As an initial step towards exploring this question, we tested the Cap Pa models from Tschannen et al. (2023); those models are of generative nature too but instead of text-to-image modeling they are trained to produce image captions. The results are shown in Figure 9. Both Cap Pa variants (B16 and L16) are more shape biased than most discriminative models, but at the same time both are less shape biased than the text-to-image generative classiﬁers we tested. This could indicate that generative modeling is helpful when it comes to shape bias, while the objective function and data determine how strong the shape bias is. Method details for Cap Pa evaluation. We obtained the original B16 and L16 checkpoints from the Cap Pa paper. We used the identical prompt as for the generative classiﬁers, A bad photo of a class where class is substituted for e.g. car (and using an for classes starting Published as a conference paper at ICLR 2024 with a vowel). We only used this single prompt since initial experiments showed that Cap Pa Image Net validation accuracy decreases when using the 80 CLIP Image Net template prompts (w.r.t. just using a single prompt). This could be attributed to the fact that the Cap Pa models were not designed to handle different prompts well. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fraction of 'shape' decisions 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fraction of 'texture' decisions Shape categories Figure 9: Shape bias of Cap Pa models. This plot is identical to Figure 1 but contains two additional datapoints: the Cap Pa models which are trained to produce text captions for images. The grey stars correspond to Cap Pa B16; the black stars correspond to Cap Pa L16. Both models are trained on 9B image-text pairs. G ADDITIONAL PLOTS FOR MODEL-VS-HUMAN BENCHMARK We here plot detailed performance for all models with respect to a few different properties of interest / metrics: Aggregated performance across 17 datasets: Figure 10 Out-of-distribution accuracy: Figure 11 for parametric datasets and Figure 12 for nonparametric datasets Model-to-human error consistency: Figure 11 Human-to-human, model-to-model error consistency (for nonparametric datasets): Figures 6 and 15 to 18 Shape bias: Figure 13 and Figure 14 All metrics are based on the model-vs-human toolbox and explained in more detail in (Geirhos et al., 2021). Published as a conference paper at ICLR 2024 OOD accuracy squeezenet1_0 squeezenet1_1 shufflenet_v2_x0_5 Ins Dis: Res Net-50 PIRL: Res Net-50 Res Net-50 L2 eps 5.0 Mo Co: Res Net-50 mobilenet_v2 Res Net-50 L2 eps 3.0 Res Net-50 L2 eps 0.0 Res Net-50 L2 eps 1.0 CLIP: Vi T-B (400M), 1 prompt Res Net-50 L2 eps 0.5 wide_resnet50_2 resnext50_32x4d Info Min: Res Net-50 inception_v3 Mo Co V2: Res Net-50 densenet121 wide_resnet101_2 resnext101_32x8d Sim CLR: Res Net-50x1 densenet169 densenet201 Vi T-22B_ft-560 (4B) Sim CLR: Res Net-50x2 Bi T-M: Res Net-50x1 (14M) Vi T-B (14M) Bi T-M: Res Net-101x1 (14M) SWSL: Res Net-50 (940M) Sim CLR: Res Net-50x4 Bi T-M: Res Net-50x3 (14M) Bi T-M: Res Net-101x3 (14M) Bi T-M: Res Net-152x4 (14M) Stable Diffusion Bi T-M: Res Net-152x2 (14M) SWSL: Res Ne Xt-101 (940M) CLIP: Vi T-B (400M), 80 prompts Vi T-L (14M) Vi T-22B_ft-384 (4B) Noisy Student: ENet L2 (300M) Vi T-22B_ft-224 (4B) (a) OOD accuracy (higher = better). accuracy difference squeezenet1_0 Ins Dis: Res Net-50 squeezenet1_1 shufflenet_v2_x0_5 PIRL: Res Net-50 Mo Co: Res Net-50 mobilenet_v2 Res Net-50 L2 eps 5.0 Info Min: Res Net-50 Res Net-50 L2 eps 0.0 wide_resnet50_2 Mo Co V2: Res Net-50 Sim CLR: Res Net-50x1 Res Net-50 L2 eps 3.0 resnext50_32x4d Res Net-50 L2 eps 1.0 Res Net-50 L2 eps 0.5 densenet121 resnext101_32x8d Sim CLR: Res Net-50x2 Sim CLR: Res Net-50x4 wide_resnet101_2 inception_v3 densenet169 densenet201 CLIP: Vi T-B (400M), 1 prompt Vi T-B (14M) Bi T-M: Res Net-50x1 (14M) SWSL: Res Net-50 (940M) Bi T-M: Res Net-101x3 (14M) Bi T-M: Res Net-50x3 (14M) Noisy Student: ENet L2 (300M) Bi T-M: Res Net-152x4 (14M) Bi T-M: Res Net-152x2 (14M) Vi T-L (14M) Bi T-M: Res Net-101x1 (14M) Vi T-22B_ft-224 (4B) SWSL: Res Ne Xt-101 (940M) CLIP: Vi T-B (400M), 80 prompts Stable Diffusion Vi T-22B_ft-560 (4B) Vi T-22B_ft-384 (4B) (b) Accuracy difference (lower = better). observed consistency squeezenet1_0 shufflenet_v2_x0_5 squeezenet1_1 Ins Dis: Res Net-50 PIRL: Res Net-50 Mo Co: Res Net-50 mobilenet_v2 Res Net-50 L2 eps 5.0 Res Net-50 L2 eps 0.0 Info Min: Res Net-50 Mo Co V2: Res Net-50 wide_resnet50_2 Res Net-50 L2 eps 3.0 resnext50_32x4d Sim CLR: Res Net-50x1 Res Net-50 L2 eps 0.5 Res Net-50 L2 eps 1.0 densenet121 resnext101_32x8d wide_resnet101_2 inception_v3 Sim CLR: Res Net-50x2 densenet169 CLIP: Vi T-B (400M), 1 prompt densenet201 Sim CLR: Res Net-50x4 Vi T-B (14M) Bi T-M: Res Net-50x1 (14M) Bi T-M: Res Net-101x3 (14M) Bi T-M: Res Net-50x3 (14M) SWSL: Res Net-50 (940M) Bi T-M: Res Net-152x4 (14M) Bi T-M: Res Net-101x1 (14M) Bi T-M: Res Net-152x2 (14M) Vi T-22B_ft-560 (4B) Stable Diffusion Vi T-L (14M) SWSL: Res Ne Xt-101 (940M) CLIP: Vi T-B (400M), 80 prompts Noisy Student: ENet L2 (300M) Vi T-22B_ft-224 (4B) Vi T-22B_ft-384 (4B) (c) Observed consistency (higher = better). error consistency Ins Dis: Res Net-50 PIRL: Res Net-50 Mo Co: Res Net-50 squeezenet1_0 shufflenet_v2_x0_5 Info Min: Res Net-50 Noisy Student: ENet L2 (300M) squeezenet1_1 wide_resnet50_2 Mo Co V2: Res Net-50 Res Net-50 L2 eps 0.0 Sim CLR: Res Net-50x4 Sim CLR: Res Net-50x1 Sim CLR: Res Net-50x2 resnext101_32x8d resnext50_32x4d wide_resnet101_2 mobilenet_v2 Vi T-22B_ft-224 (4B) densenet121 Res Net-50 L2 eps 0.5 Bi T-M: Res Net-101x3 (14M) Vi T-L (14M) densenet169 Vi T-B (14M) inception_v3 SWSL: Res Net-50 (940M) densenet201 Res Net-50 L2 eps 1.0 Bi T-M: Res Net-50x3 (14M) Bi T-M: Res Net-152x4 (14M) SWSL: Res Ne Xt-101 (940M) Res Net-50 L2 eps 3.0 Res Net-50 L2 eps 5.0 Bi T-M: Res Net-50x1 (14M) Bi T-M: Res Net-152x2 (14M) Bi T-M: Res Net-101x1 (14M) CLIP: Vi T-B (400M), 1 prompt Vi T-22B_ft-384 (4B) Stable Diffusion CLIP: Vi T-B (400M), 80 prompts Vi T-22B_ft-560 (4B) (d) Error consistency (higher = better). Figure 10: Benchmark results for different models, aggregated over datasets. H QUANTITATIVE BENCHMARK SCORES AND RANKINGS Table 2 and Table 3 list the detailed performance aggregated across 17 datasets for each model, with the former focusing on metrics related to most human-like object recognition behavior and the latter focusing on out-of-distribution accuracy. Published as a conference paper at ICLR 2024 colour greyscale Colour Classification accuracy (a) Colour / greyscale Error consistency colour greyscale Colour Error consistency (kappa) true opponent Colour Classification accuracy (b) True / false colour Error consistency true opponent Colour Error consistency (kappa) 0.0 .03 .05 .1 .2 .35 .6 .9 Uniform noise width Classification accuracy (c) Uniform noise 0.0 .03 .05 .1 .2 .35 .6 .9 Uniform noise width Error consistency (kappa) 0 1 3 5 7 10 15 40 Filter standard deviation Classification accuracy (d) Low-pass 0 1 3 5 7 10 15 40 Filter standard deviation Error consistency (kappa) 100 50 30 15 10 5 3 1 Contrast in percent Classification accuracy (e) Contrast 100 50 30 15 10 5 3 1 Contrast in percent Error consistency (kappa) inf 3.0 1.5 1.0 .7 .55 .45 .4 Filter standard deviation Classification accuracy (f) High-pass inf 3.0 1.5 1.0 .7 .55 .45 .4 Filter standard deviation Error consistency (kappa) 0 1 2 3 4 5 6 7 Log2 of 'reach' parameter Classification accuracy (g) Eidolon I 0 1 2 3 4 5 6 7 Log2 of 'reach' parameter Error consistency (kappa) 0 30 60 90 120 150 180 Phase noise width [ ] Classification accuracy (h) Phase noise 0 30 60 90 120 150 180 Phase noise width [ ] Error consistency (kappa) 0 1 2 3 4 5 6 7 Log2 of 'reach' parameter Classification accuracy (i) Eidolon II 0 1 2 3 4 5 6 7 Log2 of 'reach' parameter Error consistency (kappa) original equalised Power spectrum Classification accuracy (j) Power equalisation original equalised Power spectrum Error consistency (kappa) 0 1 2 3 4 5 6 7 Log2 of 'reach' parameter Classification accuracy (k) Eidolon III 0 1 2 3 4 5 6 7 Log2 of 'reach' parameter Error consistency (kappa) 0 90 180 270 Rotation angle [ ] Classification accuracy (l) Rotation 0 90 180 270 Rotation angle [ ] Error consistency (kappa) Figure 11: OOD accuracy and error consistency across all twelve parametric datasets from Geirhos et al. (2021). Error consistency results for nonparametric datasets are plotted in Figures 6 and 15 to 18. Published as a conference paper at ICLR 2024 OOD accuracy Ins Dis: Res Net-50 PIRL: Res Net-50 squeezenet1_0 shufflenet_v2_x0_5 Mo Co: Res Net-50 squeezenet1_1 mobilenet_v2 Res Net-50 L2 eps 5.0 Mo Co V2: Res Net-50 Sim CLR: Res Net-50x1 Info Min: Res Net-50 Res Net-50 L2 eps 3.0 wide_resnet50_2 Sim CLR: Res Net-50x2 Res Net-50 L2 eps 0.0 resnext50_32x4d inception_v3 densenet121 Res Net-50 L2 eps 1.0 Res Net-50 L2 eps 0.5 Sim CLR: Res Net-50x4 densenet169 wide_resnet101_2 densenet201 resnext101_32x8d Bi T-M: Res Net-50x1 (14M) Bi T-M: Res Net-101x1 (14M) Vi T-B (14M) Bi T-M: Res Net-50x3 (14M) Bi T-M: Res Net-101x3 (14M) Bi T-M: Res Net-152x2 (14M) Bi T-M: Res Net-152x4 (14M) Stable Diffusion Vi T-L (14M) CLIP: Vi T-B (400M), 1 prompt SWSL: Res Net-50 (940M) SWSL: Res Ne Xt-101 (940M) Noisy Student: ENet L2 (300M) CLIP: Vi T-B (400M), 80 prompts Vi T-22B_ft-560 (4B) Vi T-22B_ft-384 (4B) Vi T-22B_ft-224 (4B) (a) Accuracy on sketch images. OOD accuracy Ins Dis: Res Net-50 shufflenet_v2_x0_5 PIRL: Res Net-50 squeezenet1_1 squeezenet1_0 Sim CLR: Res Net-50x1 Res Net-50 L2 eps 1.0 wide_resnet50_2 Mo Co V2: Res Net-50 mobilenet_v2 Res Net-50 L2 eps 0.5 resnext101_32x8d resnext50_32x4d Res Net-50 L2 eps 0.0 Sim CLR: Res Net-50x2 densenet121 Mo Co: Res Net-50 Res Net-50 L2 eps 3.0 Info Min: Res Net-50 Res Net-50 L2 eps 5.0 inception_v3 Sim CLR: Res Net-50x4 densenet169 wide_resnet101_2 densenet201 Vi T-B (14M) Bi T-M: Res Net-50x1 (14M) Bi T-M: Res Net-50x3 (14M) Bi T-M: Res Net-101x3 (14M) Bi T-M: Res Net-152x2 (14M) Vi T-L (14M) Bi T-M: Res Net-101x1 (14M) Bi T-M: Res Net-152x4 (14M) CLIP: Vi T-B (400M), 80 prompts Vi T-22B_ft-560 (4B) SWSL: Res Net-50 (940M) SWSL: Res Ne Xt-101 (940M) CLIP: Vi T-B (400M), 1 prompt Vi T-22B_ft-384 (4B) Stable Diffusion Vi T-22B_ft-224 (4B) Noisy Student: ENet L2 (300M) (b) Accuracy on edge images. OOD accuracy Ins Dis: Res Net-50 Mo Co: Res Net-50 squeezenet1_0 PIRL: Res Net-50 squeezenet1_1 shufflenet_v2_x0_5 Info Min: Res Net-50 densenet121 Mo Co V2: Res Net-50 mobilenet_v2 Res Net-50 L2 eps 0.0 Sim CLR: Res Net-50x1 wide_resnet101_2 densenet201 wide_resnet50_2 Sim CLR: Res Net-50x2 densenet169 CLIP: Vi T-B (400M), 1 prompt inception_v3 resnext50_32x4d Res Net-50 L2 eps 0.5 resnext101_32x8d Sim CLR: Res Net-50x4 Bi T-M: Res Net-152x4 (14M) Bi T-M: Res Net-50x1 (14M) SWSL: Res Net-50 (940M) Res Net-50 L2 eps 1.0 Res Net-50 L2 eps 5.0 Res Net-50 L2 eps 3.0 Bi T-M: Res Net-152x2 (14M) Bi T-M: Res Net-101x3 (14M) Vi T-B (14M) Bi T-M: Res Net-50x3 (14M) SWSL: Res Ne Xt-101 (940M) Bi T-M: Res Net-101x1 (14M) Vi T-L (14M) Vi T-22B_ft-560 (4B) CLIP: Vi T-B (400M), 80 prompts Stable Diffusion Noisy Student: ENet L2 (300M) Vi T-22B_ft-384 (4B) Vi T-22B_ft-224 (4B) (c) Accuracy on silhouette images. OOD accuracy squeezenet1_0 Ins Dis: Res Net-50 PIRL: Res Net-50 squeezenet1_1 Info Min: Res Net-50 Mo Co: Res Net-50 CLIP: Vi T-B (400M), 1 prompt Res Net-50 L2 eps 0.0 shufflenet_v2_x0_5 Vi T-22B_ft-560 (4B) wide_resnet50_2 mobilenet_v2 Mo Co V2: Res Net-50 densenet121 wide_resnet101_2 resnext50_32x4d Res Net-50 L2 eps 5.0 Sim CLR: Res Net-50x1 Res Net-50 L2 eps 3.0 densenet201 densenet169 resnext101_32x8d Res Net-50 L2 eps 0.5 Res Net-50 L2 eps 1.0 inception_v3 Sim CLR: Res Net-50x4 Sim CLR: Res Net-50x2 SWSL: Res Net-50 (940M) Stable Diffusion Bi T-M: Res Net-50x1 (14M) Bi T-M: Res Net-101x3 (14M) Bi T-M: Res Net-101x1 (14M) Bi T-M: Res Net-152x4 (14M) Bi T-M: Res Net-50x3 (14M) Vi T-B (14M) Bi T-M: Res Net-152x2 (14M) CLIP: Vi T-B (400M), 80 prompts SWSL: Res Ne Xt-101 (940M) Vi T-L (14M) Vi T-22B_ft-384 (4B) Noisy Student: ENet L2 (300M) Vi T-22B_ft-224 (4B) (d) Accuracy on stylized images. Figure 12: OOD accuracy on all four nonparametric datasets (i.e., datasets with only a single corruption type and strength). squeezenet1_0 squeezenet1_1 densenet121 densenet169 densenet201 inception_v3 shufflenet_v2_x0_5 mobilenet_v2 resnext50_32x4d resnext101_32x8d wide_resnet50_2 wide_resnet101_2 Ins Dis: Res Net-50 Mo Co: Res Net-50 PIRL: Res Net-50 Mo Co V2: Res Net-50 Info Min: Res Net-50 Sim CLR: Res Net-50x1 Sim CLR: Res Net-50x2 Sim CLR: Res Net-50x4 Res Net-50 L2 eps 0.0 Res Net-50 L2 eps 0.5 Res Net-50 L2 eps 1.0 Res Net-50 L2 eps 3.0 Res Net-50 L2 eps 5.0 SWSL: Res Ne Xt-101 (940M) SWSL: Res Net-50 (940M) Bi T-M: Res Net-152x4 (14M) Bi T-M: Res Net-152x2 (14M) Bi T-M: Res Net-101x3 (14M) Bi T-M: Res Net-101x1 (14M) Bi T-M: Res Net-50x3 (14M) Bi T-M: Res Net-50x1 (14M) Vi T-L (14M) Vi T-B (14M) Noisy Student: ENet L2 (300M) CLIP: Vi T-B (400M), 1 prompt CLIP: Vi T-B (400M), 80 prompts Vi T-22B_ft-560 (4B) Vi T-22B_ft-384 (4B) Vi T-22B_ft-224 (4B) Stable Diffusion Figure 13: Zero-shot generative classiﬁers achieve a human-level shape bias: 99% for Imagen, 93% for Stable Diffusion, 92% for Parti and 92 99% for individual human observers (96% on average). This ﬁgure shows boxplots highlighting the spread across 16 categories for each model as a different way of visualizing the data from Figure 1. Published as a conference paper at ICLR 2024 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fraction of 'shape' decisions 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fraction of 'texture' decisions Shape categories Figure 14: Shape bias before and after training with diffusion noise: This ﬁgure shows how Res Net50 shape bias increases from 21% for a vanilla model (dark grey circles) to 78% when trained and evaluated with diffusion noise (black circles). Horizontal lines indicate the average shape bias across categories. Other models as in Figure 13. Published as a conference paper at ICLR 2024 squeezenet1_0 squeezenet1_1 densenet121 densenet169 densenet201 inception_v3 shufflenet_v2_x0_5 mobilenet_v2 resnext50_32x4d resnext101_32x8d wide_resnet50_2 wide_resnet101_2 Ins Dis: Res Net-50 Mo Co: Res Net-50 PIRL: Res Net-50 Mo Co V2: Res Net-50 Info Min: Res Net-50 Sim CLR: Res Net-50x1 Sim CLR: Res Net-50x2 Sim CLR: Res Net-50x4 Res Net-50 L2 eps 0.0 Res Net-50 L2 eps 0.5 Res Net-50 L2 eps 1.0 Res Net-50 L2 eps 3.0 Res Net-50 L2 eps 5.0 SWSL: Res Ne Xt-101 (940M) SWSL: Res Net-50 (940M) Bi T-M: Res Net-152x4 (14M) Bi T-M: Res Net-152x2 (14M) Bi T-M: Res Net-101x3 (14M) Bi T-M: Res Net-101x1 (14M) Bi T-M: Res Net-50x3 (14M) Bi T-M: Res Net-50x1 (14M) Vi T-L (14M) Vi T-B (14M) Noisy Student: ENet L2 (300M) CLIP: Vi T-B (400M), 1 prompt CLIP: Vi T-B (400M), 80 prompts Vi T-22B_ft-560 (4B) Vi T-22B_ft-384 (4B) Vi T-22B_ft-224 (4B) Stable Diffusion alexnet vgg11_bn vgg13_bn vgg16_bn vgg19_bn squeezenet1_0 squeezenet1_1 densenet121 densenet169 densenet201 inception_v3 resnet18 resnet34 resnet50 resnet101 resnet152 shufflenet_v2_x0_5 mobilenet_v2 resnext50_32x4d resnext101_32x8d wide_resnet50_2 wide_resnet101_2 mnasnet0_5 mnasnet1_0 Ins Dis: Res Net-50 Mo Co: Res Net-50 PIRL: Res Net-50 Mo Co V2: Res Net-50 Info Min: Res Net-50 Sim CLR: Res Net-50x1 Sim CLR: Res Net-50x2 Sim CLR: Res Net-50x4 Res Net-50 L2 eps 0.0 Res Net-50 L2 eps 0.5 Res Net-50 L2 eps 1.0 Res Net-50 L2 eps 3.0 Res Net-50 L2 eps 5.0 Vi T-S Vi T-B Vi T-L SWSL: Res Ne Xt-101 (940M) SWSL: Res Net-50 (940M) Bi T-M: Res Net-152x4 (14M) Bi T-M: Res Net-152x2 (14M) Bi T-M: Res Net-101x3 (14M) Bi T-M: Res Net-101x1 (14M) Bi T-M: Res Net-50x3 (14M) Bi T-M: Res Net-50x1 (14M) Vi T-L (14M) Vi T-B (14M) Noisy Student: ENet L2 (300M) CLIP: Vi T-B (400M), 1 prompt CLIP: Vi T-B (400M), 80 prompts Vi T-22B_ft-560 (4B) Vi T-22B_ft-384 (4B) Vi T-22B_ft-224 (4B) Parti Stable Diffusion Imagen subject-01 subject-02 subject-03 subject-04 subject-05 subject-06 subject-07 Figure 15: Error consistency for sketch images. squeezenet1_0 squeezenet1_1 densenet121 densenet169 densenet201 inception_v3 shufflenet_v2_x0_5 mobilenet_v2 resnext50_32x4d resnext101_32x8d wide_resnet50_2 wide_resnet101_2 Ins Dis: Res Net-50 Mo Co: Res Net-50 PIRL: Res Net-50 Mo Co V2: Res Net-50 Info Min: Res Net-50 Sim CLR: Res Net-50x1 Sim CLR: Res Net-50x2 Sim CLR: Res Net-50x4 Res Net-50 L2 eps 0.0 Res Net-50 L2 eps 0.5 Res Net-50 L2 eps 1.0 Res Net-50 L2 eps 3.0 Res Net-50 L2 eps 5.0 SWSL: Res Ne Xt-101 (940M) SWSL: Res Net-50 (940M) Bi T-M: Res Net-152x4 (14M) Bi T-M: Res Net-152x2 (14M) Bi T-M: Res Net-101x3 (14M) Bi T-M: Res Net-101x1 (14M) Bi T-M: Res Net-50x3 (14M) Bi T-M: Res Net-50x1 (14M) Vi T-L (14M) Vi T-B (14M) Noisy Student: ENet L2 (300M) CLIP: Vi T-B (400M), 1 prompt CLIP: Vi T-B (400M), 80 prompts Vi T-22B_ft-560 (4B) Vi T-22B_ft-384 (4B) Vi T-22B_ft-224 (4B) Stable Diffusion alexnet vgg11_bn vgg13_bn vgg16_bn vgg19_bn squeezenet1_0 squeezenet1_1 densenet121 densenet169 densenet201 inception_v3 resnet18 resnet34 resnet50 resnet101 resnet152 shufflenet_v2_x0_5 mobilenet_v2 resnext50_32x4d resnext101_32x8d wide_resnet50_2 wide_resnet101_2 mnasnet0_5 mnasnet1_0 Ins Dis: Res Net-50 Mo Co: Res Net-50 PIRL: Res Net-50 Mo Co V2: Res Net-50 Info Min: Res Net-50 Sim CLR: Res Net-50x1 Sim CLR: Res Net-50x2 Sim CLR: Res Net-50x4 Res Net-50 L2 eps 0.0 Res Net-50 L2 eps 0.5 Res Net-50 L2 eps 1.0 Res Net-50 L2 eps 3.0 Res Net-50 L2 eps 5.0 Vi T-S Vi T-B Vi T-L SWSL: Res Ne Xt-101 (940M) SWSL: Res Net-50 (940M) Bi T-M: Res Net-152x4 (14M) Bi T-M: Res Net-152x2 (14M) Bi T-M: Res Net-101x3 (14M) Bi T-M: Res Net-101x1 (14M) Bi T-M: Res Net-50x3 (14M) Bi T-M: Res Net-50x1 (14M) Vi T-L (14M) Vi T-B (14M) Noisy Student: ENet L2 (300M) CLIP: Vi T-B (400M), 1 prompt CLIP: Vi T-B (400M), 80 prompts Vi T-22B_ft-560 (4B) Vi T-22B_ft-384 (4B) Vi T-22B_ft-224 (4B) Parti Stable Diffusion Imagen subject-01 subject-02 subject-03 subject-04 subject-05 Figure 16: Error consistency for stylized images. Published as a conference paper at ICLR 2024 squeezenet1_0 squeezenet1_1 densenet121 densenet169 densenet201 inception_v3 shufflenet_v2_x0_5 mobilenet_v2 resnext50_32x4d resnext101_32x8d wide_resnet50_2 wide_resnet101_2 Ins Dis: Res Net-50 Mo Co: Res Net-50 PIRL: Res Net-50 Mo Co V2: Res Net-50 Info Min: Res Net-50 Sim CLR: Res Net-50x1 Sim CLR: Res Net-50x2 Sim CLR: Res Net-50x4 Res Net-50 L2 eps 0.0 Res Net-50 L2 eps 0.5 Res Net-50 L2 eps 1.0 Res Net-50 L2 eps 3.0 Res Net-50 L2 eps 5.0 SWSL: Res Ne Xt-101 (940M) SWSL: Res Net-50 (940M) Bi T-M: Res Net-152x4 (14M) Bi T-M: Res Net-152x2 (14M) Bi T-M: Res Net-101x3 (14M) Bi T-M: Res Net-101x1 (14M) Bi T-M: Res Net-50x3 (14M) Bi T-M: Res Net-50x1 (14M) Vi T-L (14M) Vi T-B (14M) Noisy Student: ENet L2 (300M) CLIP: Vi T-B (400M), 1 prompt CLIP: Vi T-B (400M), 80 prompts Vi T-22B_ft-560 (4B) Vi T-22B_ft-384 (4B) Vi T-22B_ft-224 (4B) Stable Diffusion alexnet vgg11_bn vgg13_bn vgg16_bn vgg19_bn squeezenet1_0 squeezenet1_1 densenet121 densenet169 densenet201 inception_v3 resnet18 resnet34 resnet50 resnet101 resnet152 shufflenet_v2_x0_5 mobilenet_v2 resnext50_32x4d resnext101_32x8d wide_resnet50_2 wide_resnet101_2 mnasnet0_5 mnasnet1_0 Ins Dis: Res Net-50 Mo Co: Res Net-50 PIRL: Res Net-50 Mo Co V2: Res Net-50 Info Min: Res Net-50 Sim CLR: Res Net-50x1 Sim CLR: Res Net-50x2 Sim CLR: Res Net-50x4 Res Net-50 L2 eps 0.0 Res Net-50 L2 eps 0.5 Res Net-50 L2 eps 1.0 Res Net-50 L2 eps 3.0 Res Net-50 L2 eps 5.0 Vi T-S Vi T-B Vi T-L SWSL: Res Ne Xt-101 (940M) SWSL: Res Net-50 (940M) Bi T-M: Res Net-152x4 (14M) Bi T-M: Res Net-152x2 (14M) Bi T-M: Res Net-101x3 (14M) Bi T-M: Res Net-101x1 (14M) Bi T-M: Res Net-50x3 (14M) Bi T-M: Res Net-50x1 (14M) Vi T-L (14M) Vi T-B (14M) Noisy Student: ENet L2 (300M) CLIP: Vi T-B (400M), 1 prompt CLIP: Vi T-B (400M), 80 prompts Vi T-22B_ft-560 (4B) Vi T-22B_ft-384 (4B) Vi T-22B_ft-224 (4B) Parti Stable Diffusion Imagen subject-01 subject-02 subject-03 subject-04 subject-05 subject-06 subject-07 subject-08 subject-09 subject-10 Figure 17: Error consistency for edge images. squeezenet1_0 squeezenet1_1 densenet121 densenet169 densenet201 inception_v3 shufflenet_v2_x0_5 mobilenet_v2 resnext50_32x4d resnext101_32x8d wide_resnet50_2 wide_resnet101_2 Ins Dis: Res Net-50 Mo Co: Res Net-50 PIRL: Res Net-50 Mo Co V2: Res Net-50 Info Min: Res Net-50 Sim CLR: Res Net-50x1 Sim CLR: Res Net-50x2 Sim CLR: Res Net-50x4 Res Net-50 L2 eps 0.0 Res Net-50 L2 eps 0.5 Res Net-50 L2 eps 1.0 Res Net-50 L2 eps 3.0 Res Net-50 L2 eps 5.0 SWSL: Res Ne Xt-101 (940M) SWSL: Res Net-50 (940M) Bi T-M: Res Net-152x4 (14M) Bi T-M: Res Net-152x2 (14M) Bi T-M: Res Net-101x3 (14M) Bi T-M: Res Net-101x1 (14M) Bi T-M: Res Net-50x3 (14M) Bi T-M: Res Net-50x1 (14M) Vi T-L (14M) Vi T-B (14M) Noisy Student: ENet L2 (300M) CLIP: Vi T-B (400M), 1 prompt CLIP: Vi T-B (400M), 80 prompts Vi T-22B_ft-560 (4B) Vi T-22B_ft-384 (4B) Vi T-22B_ft-224 (4B) Stable Diffusion alexnet vgg11_bn vgg13_bn vgg16_bn vgg19_bn squeezenet1_0 squeezenet1_1 densenet121 densenet169 densenet201 inception_v3 resnet18 resnet34 resnet50 resnet101 resnet152 shufflenet_v2_x0_5 mobilenet_v2 resnext50_32x4d resnext101_32x8d wide_resnet50_2 wide_resnet101_2 mnasnet0_5 mnasnet1_0 Ins Dis: Res Net-50 Mo Co: Res Net-50 PIRL: Res Net-50 Mo Co V2: Res Net-50 Info Min: Res Net-50 Sim CLR: Res Net-50x1 Sim CLR: Res Net-50x2 Sim CLR: Res Net-50x4 Res Net-50 L2 eps 0.0 Res Net-50 L2 eps 0.5 Res Net-50 L2 eps 1.0 Res Net-50 L2 eps 3.0 Res Net-50 L2 eps 5.0 Vi T-S Vi T-B Vi T-L SWSL: Res Ne Xt-101 (940M) SWSL: Res Net-50 (940M) Bi T-M: Res Net-152x4 (14M) Bi T-M: Res Net-152x2 (14M) Bi T-M: Res Net-101x3 (14M) Bi T-M: Res Net-101x1 (14M) Bi T-M: Res Net-50x3 (14M) Bi T-M: Res Net-50x1 (14M) Vi T-L (14M) Vi T-B (14M) Noisy Student: ENet L2 (300M) CLIP: Vi T-B (400M), 1 prompt CLIP: Vi T-B (400M), 80 prompts Vi T-22B_ft-560 (4B) Vi T-22B_ft-384 (4B) Vi T-22B_ft-224 (4B) Parti Stable Diffusion Imagen subject-01 subject-02 subject-03 subject-04 subject-05 subject-06 subject-07 subject-08 subject-09 subject-10 Figure 18: Error consistency for silhouette images. Published as a conference paper at ICLR 2024 Table 2: Benchmark table of model results for most human-like behaviour, aggregated over all 17 datasets from Geirhos et al. (2021). The three metrics accuracy difference observed consistency and error consistency each produce a different model ranking. The mean rank of a model across those three metrics is used to rank the models on our benchmark. model accuracy diff. obs. consistency error consistency mean rank Vi T-22B ft-384 (4B) 0.018 0.783 0.258 2.333 Imagen (860M) 0.023 0.761 0.309 3.000 Vi T-22B ft-560 (4B) 0.022 0.739 0.281 4.333 CLIP: Vi T-B (400M), 80 prompts 0.023 0.758 0.281 4.333 Stable Diffusion 0.023 0.743 0.264 5.000 SWSL: Res Ne Xt-101 (940M) 0.028 0.752 0.237 8.000 Bi T-M: Res Net-101x1 (14M) 0.034 0.733 0.252 9.333 Bi T-M: Res Net-152x2 (14M) 0.035 0.737 0.243 10.000 Vi T-L 0.033 0.738 0.222 11.667 Bi T-M: Res Net-152x4 (14M) 0.035 0.732 0.233 12.667 Vi T-L (14M) 0.035 0.744 0.206 14.000 Vi T-22B ft-224 (4B) 0.030 0.781 0.197 14.000 Bi T-M: Res Net-50x3 (14M) 0.040 0.726 0.228 14.333 Bi T-M: Res Net-50x1 (14M) 0.042 0.718 0.240 14.667 CLIP: Vi T-B (400M), 1 prompt 0.054 0.688 0.257 16.000 SWSL: Res Net-50 (940M) 0.041 0.727 0.211 16.667 Vi T-B 0.044 0.719 0.223 17.000 Bi T-M: Res Net-101x3 (14M) 0.040 0.720 0.204 19.333 Vi T-B (14M) 0.049 0.717 0.209 20.000 densenet201 0.060 0.695 0.212 20.333 Noisy Student: ENet L2 (300M) 0.040 0.764 0.169 22.333 Vi T-S 0.066 0.684 0.216 22.333 densenet169 0.065 0.688 0.207 23.000 inception v3 0.066 0.677 0.211 23.333 Res Net-50 L2 eps 1.0 0.079 0.669 0.224 26.667 Res Net-50 L2 eps 3.0 0.079 0.663 0.239 27.667 Sim CLR: Res Net-50x4 0.071 0.698 0.179 30.333 wide resnet101 2 0.068 0.676 0.187 30.333 Res Net-50 L2 eps 0.5 0.078 0.668 0.203 31.000 densenet121 0.077 0.671 0.200 31.000 Sim CLR: Res Net-50x2 0.073 0.686 0.180 31.333 resnet152 0.077 0.675 0.190 31.667 resnet101 0.074 0.671 0.192 31.667 resnext101 32x8d 0.074 0.674 0.182 32.667 Res Net-50 L2 eps 5.0 0.087 0.649 0.240 32.667 resnet50 0.087 0.665 0.208 34.333 resnet34 0.084 0.662 0.205 35.000 vgg19 bn 0.081 0.660 0.200 35.667 resnext50 32x4d 0.079 0.666 0.184 36.333 Sim CLR: Res Net-50x1 0.080 0.667 0.179 38.000 resnet18 0.091 0.648 0.201 40.333 vgg16 bn 0.088 0.651 0.198 40.333 wide resnet50 2 0.084 0.663 0.176 41.667 Mo Co V2: Res Net-50 0.083 0.660 0.177 42.000 mobilenet v2 0.092 0.645 0.196 43.000 Res Net-50 L2 eps 0.0 0.086 0.654 0.178 43.333 mnasnet1 0 0.092 0.646 0.189 44.333 vgg11 bn 0.106 0.635 0.193 44.667 Info Min: Res Net-50 0.086 0.659 0.168 45.333 vgg13 bn 0.101 0.631 0.180 47.000 mnasnet0 5 0.110 0.617 0.173 51.000 Mo Co: Res Net-50 0.107 0.617 0.149 53.000 alexnet 0.118 0.597 0.165 53.333 squeezenet1 1 0.131 0.593 0.175 53.667 PIRL: Res Net-50 0.119 0.607 0.141 54.667 shufﬂenet v2 x0 5 0.126 0.592 0.160 55.333 Ins Dis: Res Net-50 0.131 0.593 0.138 56.667 squeezenet1 0 0.145 0.574 0.153 57.000 Published as a conference paper at ICLR 2024 Table 3: Benchmark table of model results for highest out-of-distribution robustness, aggregated over all 17 datasets from Geirhos et al. (2021). model OOD accuracy rank Vi T-22B ft-224 (4B) 0.837 1.000 Noisy Student: ENet L2 (300M) 0.829 2.000 Vi T-22B ft-384 (4B) 0.798 3.000 Vi T-L (14M) 0.733 4.000 CLIP: Vi T-B (400M), 80 prompts 0.708 5.000 Imagen (860M) 0.706 6.000 Vi T-L 0.706 7.000 SWSL: Res Ne Xt-101 (940M) 0.698 8.000 Bi T-M: Res Net-152x2 (14M) 0.694 9.000 Stable Diffusion 0.689 10.000 Bi T-M: Res Net-152x4 (14M) 0.688 11.000 Bi T-M: Res Net-101x3 (14M) 0.682 12.000 Bi T-M: Res Net-50x3 (14M) 0.679 13.000 Sim CLR: Res Net-50x4 0.677 14.000 SWSL: Res Net-50 (940M) 0.677 15.000 Bi T-M: Res Net-101x1 (14M) 0.672 16.000 Vi T-B (14M) 0.669 17.000 Vi T-B 0.658 18.000 Bi T-M: Res Net-50x1 (14M) 0.654 19.000 Sim CLR: Res Net-50x2 0.644 20.000 Vi T-22B ft-560 (4B) 0.639 21.000 densenet201 0.621 22.000 densenet169 0.613 23.000 Sim CLR: Res Net-50x1 0.596 24.000 resnext101 32x8d 0.594 25.000 resnet152 0.584 26.000 wide resnet101 2 0.583 27.000 resnet101 0.583 28.000 Vi T-S 0.579 29.000 densenet121 0.576 30.000 Mo Co V2: Res Net-50 0.571 31.000 inception v3 0.571 32.000 Info Min: Res Net-50 0.571 33.000 resnext50 32x4d 0.569 34.000 wide resnet50 2 0.566 35.000 resnet50 0.559 36.000 resnet34 0.553 37.000 Res Net-50 L2 eps 0.5 0.551 38.000 CLIP: Vi T-B (400M), 1 prompt 0.550 39.000 Res Net-50 L2 eps 1.0 0.547 40.000 vgg19 bn 0.546 41.000 Res Net-50 L2 eps 0.0 0.545 42.000 Res Net-50 L2 eps 3.0 0.530 43.000 vgg16 bn 0.530 44.000 mnasnet1 0 0.524 45.000 resnet18 0.521 46.000 mobilenet v2 0.520 47.000 Mo Co: Res Net-50 0.502 48.000 Res Net-50 L2 eps 5.0 0.501 49.000 vgg13 bn 0.499 50.000 vgg11 bn 0.498 51.000 PIRL: Res Net-50 0.489 52.000 mnasnet0 5 0.472 53.000 Ins Dis: Res Net-50 0.468 54.000 shufﬂenet v2 x0 5 0.440 55.000 alexnet 0.434 56.000 squeezenet1 1 0.425 57.000 squeezenet1 0 0.401 58.000