# texttoimage_generation_via_energybased_clip__93b44868.pdf

Published in Transactions on Machine Learning Research (07/2025)

Text-to-Image Generation Via Energy-Based CLIP

Roy Ganz royg27592@gmail.com Electrical Engineering Department Technion

Michael Elad elad@cs.technion.ac.il Computer Science Department Technion

Reviewed on Open Review: https: // openreview. net/ forum? id= FBm Wi JXIGk

Joint Energy Models (JEMs), while drawing significant research attention, have not been successfully scaled to real-world, high-resolution datasets. We present CLIP-JEM , a novel approach extending JEMs to the multimodal vision-language domain using CLIP, integrating both generative and discriminative objectives. For the generative one, we introduce an image-text joint-energy function based on Cosine similarity in the CLIP space, training CLIP to assign low energy to real image-caption pairs and high energy otherwise. For the discriminative one, we employ contrastive adversarial loss, extending the adversarial training objective to the multimodal domain. CLIP-JEM not only generates realistic images from text but also achieves competitive results on the compositionality benchmark, outperforming leading methods with fewer parameters. Additionally, we demonstrate the superior guidance capability of CLIP-JEM by enhancing CLIP-based generative frameworks and converting unconditional diffusion models to text-based ones. Lastly, we show that our model can serve as a more robust evaluation metric for text-to-image generative tasks than CLIP.

1 Introduction

An Egyptian cat

A corgi dog

Figure 1: CLIP-JEM gradients. Demonstration of the meaningful input gradients of CLIP-JEM compared to a vanilla CLIP model with respect to different textual prompts.

Energy-based models (EBMs) (Le Cun et al., 2006) are a class of models that define a probability distribution over data points using an energy function, where lower energy values correspond to higher probabilities. These models are trained by adjusting the learned function to minimize the energy of observed data points and maximize the energy of synthetic ones, effectively aligning the energy landscape with the true data distribution. Joint Energy Models (JEMs) (Grathwohl et al., 2019) extend EBMs by utilizing a classifier s logits to also model a joint energy function. JEMs are trained with both discriminative and generative objectives, namely, aiming to classify data points and to model the joint energy function, respectively. However, both EBMs and JEMs face significant scalability challenges. Their training processes can be unstable and computationally intensive, restricting their applicability to smaller datasets and making them unsuitable for real-world, high-resolution image datasets.

Published in Transactions on Machine Learning Research (07/2025)

Adversarial training (Goodfellow et al., 2015; Madry et al., 2018) is a technique designed to enhance models robustness against adversarial examples, which are small, imperceptible perturbations added to inputs to mislead classifiers. By training models to correctly classify these adversarial examples, such training results in perceptually aligned gradients (PAG) (Tsipras et al., 2019). PAG refers to the phenomenon where the model s input gradients1 are semantically meaningful and aligned with human perception, indicating that the features learned by the model are more human-aligned. Recently, the concept of PAG was extended to the multimodal image-text domain with CLIPAG (Ganz & Elad, 2024), which applies adversarial training to the visual part of CLIP (Radford et al., 2021). This approach enables text-based generation through pixel-space optimization. However, while CLIPAG can produce good-looking images, the results are often non-realistic and heavily reliant on multiview augmentation. The need for such augmentation suggests that the gradients themselves are not sufficiently informative to generate realistic images from single views. These limitations highlight the need for more advanced techniques to achieve realistic generation using CLIP models.

In this work, we propose CLIP-JEM, a novel approach that extends JEMs to the multimodal vision-language domain using CLIP by combining it with adversarial training. This combination leverages the strengths of both techniques to address their limitations: mitigating the scalability and stability issues of JEMs and enabling high-resolution text-based generation while overcoming the non-realistic outputs typical of CLIPAG. Inspired by unimodal JEMs, we fine-tune CLIP using two objectives: generative and discriminative. For the generative objective, we introduce an image-text energy function based on Cosine similarity in the CLIP space. We train CLIP to assign low-energy values to real image-text pairs and high values to others. More specifically, inspired by EBMs, we utilize the model to draw text-based generated samples and train it to assign these with high energy values. This is done by an iterative pixel-space optimization following the model s gradients, starting from a random sample. We formulate this as a contrastive loss to align with CLIP. For the discriminative objective, we follow the path set by CLIPAG to define a contrastive adversarial loss. By combining these two objectives, we train the visual encoder of CLIP, resulting in CLIP-JEM, a model with semantically meaningful gradients (fig. 1) capable of generating realistic samples through simple pixel-space optimization (fig. 2).

We establish the effectiveness of CLIP-JEM across three key domains: text-to-image generation, guidance capabilities, and as an evaluation metric. First, CLIP-JEM enables in text-to-image generation through pixel-space optimization, producing realistic images, significantly surpassing CLIPAG by more than 20 FID points, without applying any augmentation. Despite its relatively small size, this model achieves results competitive with much larger models on the challenging compositionality benchmark, Comp Bench (Huang et al., 2023). Specifically, it surpasses Stable diffusion v2 (Rombach et al., 2022a) and methods tailored for compositionality (Liu et al., 2023b; Feng et al., 2023). We attribute this to the discriminative nature of the model, enabling it to better align with the provided prompts. Furthermore, CLIP-JEM significantly enhances text-based guiding capabilities. To this end, we illustrate that incorporating CLIP-JEM for guidance converts unconditional diffusion models (Dhariwal & Nichol, 2021; Ahn et al., 2024) into text-guided ones with just 25 diffusion steps. Additionally, replacing CLIP with CLIP-JEM in CLIP-based generative frameworks markedly boosts their performance. Finally, CLIP-JEM proves its utility as an evaluation metric (a.k.a. CLIP-Score) for text-based image editing. It shows robustness to adversarial examples and enhanced sensitivity to image quality compared to the vanilla counterpart. This indicates that CLIP-JEM is a more reliable and precise tool for assessing the quality and integrity of generated images. To summarize,

We introduce CLIP-JEM, a novel approach extending Joint Energy Models to the vision-language domain using CLIP.

CLIP-JEM enables high-resolution text-to-image generation through pixel-space optimization, achieving competitive results on a challenging compositionality benchmark.

CLIP-JEM enhances text-based guidance capabilities, boosting CLIP-based generative frameworks and converting unconditional diffusion models into text-guided ones.

We demonstrate that CLIP-JEM can serve as an improved CLIP-Score evaluation metric for text-based image editing.

1This gradient is computed as the derivative of the chosen output logit w.r.t. the input image.

Published in Transactions on Machine Learning Research (07/2025)

A lovely living room

decorated for the

A small marina with boats docked there A beautiful bird is sitting on a branch.

a beautiful boat is

docked in a bay.

A bed with a lot of

pillows and a

A bench in the shade on a grassy

A banana tree in a

forest of trees.

A baseball game is

being played in a

A street sign for the street named Crapo.

a stuffed bear with

others wearing

Figure 2: CLIP-JEM qualitative results. Images generated using CLIP-JEM with Conv Next-XXL.

2 Related Work

2.1 Energy-Based Models

EBMs (Le Cun et al., 2006) define a probability distribution over data points using an energy function:

pθ(x) = exp( Eθ(x))

where Eθ(x) assigns scalar values to data points, and Z(θ) is a normalizing constant. Lower energy values indicate higher probabilities. Training a model with parameters θ involves minimizing the energy assigned to positive data samples x+ p(x) and maximizing the energy of negative samples x pθ(x). In this framework, the model s ability to distinguish between samples is operationalized by assigning lower energy (i.e., higher likelihood) to samples drawn from the data distribution, and higher energy (i.e., lower likelihood) to those drawn from the model distribution. Convergence is achieved when the model can no longer reliably separate positive and negative samples based on their energy values. To sample from pθ( ), we employ Stochastic Gradient Langevin Dynamics (SGLD), which begins from a predefined initial distribution and iteratively updates the samples using a step size α.

xi+1 = xi α

xi + ϵ, ϵ N(0, αI). (2)

2.2 Joint-Energy Models

Recently, Grathwohl et al. (2019) observed that one can parameterized pθ(x, y) and pθ(x) using the logits of a classifiers. Given a label y and an image x, the joint distribution can be expressed as

pθ(x, y) = exp(fθ(x)y)

where fθ(x)y is the logit corresponding with the yth class label. Thus, the joint energy function is Eθ(x, y) = fθ(x)y. Marginalizing over y results in an unconditional distribution pθ( ). Training JEMs involve with optimizing both a discriminative and a generative objectives. Despite having several merits

Published in Transactions on Machine Learning Research (07/2025)

(e.g., generative capabilities and adversarial robustness), training such models often suffers from instability and even divergence. Despite recent advancements, JEMs perform well for relatively small datasets (mainly SVHN (Netzer et al., 2011), CIFAR (Krizhevsky et al., 2009) and Celeb A (Liu et al., 2015)) but are not competitive when brought to real-world visual content (Yang et al., 2023; Zhu et al., 2021; Yang & Ji, 2021). In this work, we aim to extend JEMs into the most challenging setup text-to-image generation using a CLIP-based model.

2.3 CLIP for Text-to-Image Generation

CLIP (Radford et al., 2021) is a vision-language model, pretrained to align a massive corpus of image-text pairs. The outstanding performance of CLIP visual and textual encoders has propelled great advancements in various fields. In Large Vision Language Modeling (Li et al., 2022; Zhu et al., 2023; Liu et al., 2023a; Li et al., 2023; Ganz et al., 2023; 2024), CLIP vision encoder serves as the primary visual backbone, leading to unprecedented performance. In text-to-image generation, two main lines of work harness CLIP: (i) Utilizing CLIP image-text alignment to guide the visual results to be aligned with the textual description (Frans et al., 2022; Crowson et al., 2022; Patashnik et al., 2021; Gal et al., 2022; Kwon & Ye, 2022; Vinker et al., 2022); and (ii) Using CLIP s text encoder to condition generative models (Kang et al., 2023a; Nichol et al., 2022; Ramesh et al., 2022; Rombach et al., 2022b). Unlike these works, which utilize CLIP along with a generative model, we aim to cast CLIP into an energy-based model, capable of performing text-to-image generation without an additional generative model.

2.4 Perceptually Aligned Gradients

Adversarially robust models (Carlini & Wagner, 2017; Madry et al., 2018) are designed to withstand adversarial attacks (Szegedy et al., 2014; Goodfellow et al., 2015), which are small, imperceptible perturbations aimed at misleading classifiers. It has been observed that such models exhibit a phenomenon known as Perceptually Aligned Gradients (PAG), which is absent in their non-robust counterparts. PAG refers to the model s input gradients being semantically meaningful and aligned with human perception, indicating that the features learned are more aligned with human vision (Ilyas et al., 2019; Engstrom et al., 2019; Salman et al., 2020). PAG has been harnessed for generative tasks, such as image generation and image-to-image translation (Santurkar et al., 2019), thereby improving state-of-the-art image synthesis results (Ganz & Elad, 2022). PAG has also been explored for enhanced robust classification (Blau et al., 2023). Recently, the study of PAG has been extended to the multimodal domain using CLIPAG (Ganz & Elad, 2024), which applies adversarial training to the multimodal text-to-image domain using CLIP (Radford et al., 2021). This approach enables text-based generation through pixel-space optimization. However, while CLIPAG can produce good-looking images, the results are often non-realistic and heavily reliant on multiview augmentation. The need for such augmentation suggests that the gradients themselves are not sufficiently informative to generate realistic images from single views. These limitations highlight the need for more advanced techniques for realistic text-to-image generation using CLIP models.

In this work, we aim to extend Joint Energy Models to the challenging text-to-image generation setting using a CLIP model. Our overall framework, illustrated in Figure 3, consists of two main objectives: a contrastive energy loss and an adversarial loss. In section 3.1, we first define the joint image-text energy function, forming a measure of the faithfulness and alignment of the given pair, and elaborate on its training procedure. Next, in section 3.2, we detail the contrastive adversarial loss, extending CLIP s loss to the adversarial case. Lastly, in section 3.3, we describe the overall training procedure of CLIP-JEM, resulting in a multimodal energy model capable of text-to-image generation.

3.1 Joint Image-Text Energy Via CLIP

We extend a pretrained CLIP to model the joint energy of image-text pairs. We denote its vision and textual towers as f I θ and f T θ , respectively. Given these notations, we propose to utilize CLIP to formulate an

Published in Transactions on Machine Learning Research (07/2025)

image-text joint distribution,

pθ(I, T) = exp(Cosine Similarity(f I θ (I), f T θ (T))) Z(θ) , (4)

where I and T are visual and textual inputs, Cosine Similarity is the well-known Cosine similarity measure used by CLIP, and Z(θ) is an unknown normalizing factor. Thus, the induced joint image-text energy function is Eθ(I, T) = Cosine Similarity(f I θ (I), f T θ (T)), where a higher degree of image-text similarity results in lower energy values and higher probability and vice-versa.

Perceptual Quality

Contrastive Adversarial Loss

Contrastive Energy Loss

Figure 3: CLIP-JEM method. Illustration of the contrastive losses in our method and their effect on enabling realistic text-based image synthesis via pixelspace optimization.

Similar to energy-based model training, adapting CLIP to form an energy measure requires both positive and negative image-text pairs, denoted as (I, T) and ( I, T), respectively. As CLIP is a contrastively-trained model, we utilize such pairs to formulate a contrastive energy objective, illustrated on the upper-left side of Figure 3. This rectangular matrix contains the negative joint-energy values of textual and visual inputs. Specifically, the upper part contains the energy of the positive image inputs and the lower one the negatives . Given this contrastive energy matrix, we train the model weights of f I θ ( ) using the cross-entropy loss with the main diagonal, marked in blue, as the ground-truth annotations. Namely, the objective is to obtain high values in this main diagonal (low energy, high probability) and low values elsewhere. Focusing on a certain textual input Tk, minimizing the proposed contrastive energy loss results in high Cosine similarity values of the positive pair (Ik, Tk) and low ones for the negative pair ( Ik, Tk). This formulated loss has two additional advantages: First, minimizing it results in providing low Cosine similarities and hence high energy and low probability for unmatching imagetext pairs (Ii, Tj) for i = j, enhancing the model s discriminative ability. Second, it enables the utilization of different realizations of negative samples per each text prompt, which we empirically find to stabilize the training.

The remaining question is how to obtain the negative samples. Similar to the methodology in the JEM line of work, we craft such samples in an iterative process based on the model s gradients, starting from a simple canonical distribution (e.g., a Unifrom distribution). Current JEM works utilize SGLD (Welling & Teh, 2011), which requires hundreds of iterations to obtain good samples. To mitigate the computation overhead involved in generating such samples, such works utilize a replay buffer. This data structure stores thousands of generated negative samples and updates them throughout the training process, enabling a relatively small number of SGLD steps. However, in the image-text context, employing such a solution is not feasible, as, unlike multiclass datasets such as CIFAR-10, which have a fixed amount of classes, the text-to-image setting is open-vocabulary in nature with practically infinite number of possible captions. Thus, in our setting, maintaining a replay buffer and updating its samples is not a practical solution, as the probability of having the exact two captions in a dataset is very low, eliminating the usefulness of such a structure. To enable fast negative sampling, we propose to avoid employing SGLD and utilize a momentum-based optimization with an adaptive learning rate instead. In practice, we use Adam W optimizer (Loshchilov & Hutter, 2017) to update the image in the iterative process, which enables drawing samples within a relatively small number of steps. Overall, we first draw the initial sample from a canonical uniform distribution, It=0. Next, we compute the Cosine similarity between the visual and textual encodings and calculate its input-gradients. Lastly, we update the visual inputs accordingly and repeat the process for TJEM iterations. Notably, we calculate the

Published in Transactions on Machine Learning Research (07/2025)

gradients on a slightly noisy version of the visual input, introducing randomness to the sampling mechanism. In the supplementary materials, we discuss the connection of our sampling procedure to SGLD.

3.2 Contrastive Adversarial Training

Similar to CLIPAG (Ganz & Elad, 2024), we extend adversarial training to the contrastive loss, forming a contrastive adversarial loss. Given an input batch of image-text pairs, we perform a two-step procedure (i) crafting visual adversarial examples that maximizes the similarity w.r.t the matching texts; (ii) optimizing the vision encoder s weight to minimize the contrastive adversarial loss. The adversarial contrastive loss matrix is illustrated on the lower-left side of Figure 3, and in eq. (5). In this matrix, the vertical green vector represents the encodings of adversarial visual input, while the horizontal pink vector represents the textual encodings. This approach ensures that the model learns to be robust against adversarial perturbations and equips it with semantically meaningful gradients (i.e., PAG), which enhance the iterative generation process.

(x,t) D max δ (1 Cosine Similarity(f I θI(x + δ), f T θT (t))). (5)

3.3 Training Procedure

CLIP-JEM s objective is the combination of contrastive adversarial and contrastive energy losses, and the overall training protocol is described in the supplementary materials (algorithm 1). The goal of CLIP-JEM is to obtain an image-text energy model capable of generating images based on textual descriptions. On the right side of Figure 3, we illustrate this generation process (piece-wise linear arrow), starting from a random sample from a canonical distribution (marked in gray). In this figure, any image-text pair is represented by two values: CLIP score and perceptual quality. The former measures the alignment of the pair according to our CLIP model, while the latter is a conceptual measure of the image s visual quality. We aim to generate samples with high CLIP scores (low energy) and good perceptual quality. We analyze the contribution of both objectives to this goal. The contrastive energy loss prevents CLIP-JEM from assigning high CLIP scores to low-quality images, as throughout the energy training, the model is trained to lower the CLIP scores of negative samples and provide high scores to the positive ones, which have high perceptual quality. The contrastive adversarial objective trains the model against adversarial inputs, namely, imperceptible visual changes that significantly decrease the CLIP score (these samples reside in the bottom right corner of the figure s center part). Thus, this loss prevents CLIP-JEM from assigning low CLIP scores to high-quality inputs. Overall, combining the two objectives eliminates the red modes in the upper-right panel of Figure 3, preventing the iterative generation from drawing such samples. This, in return, leads to drawing samples of the green mode, which are realistically looking images that align with their corresponding text. We demonstrate the contribution of combining these two objectives in the appendix D.

4 Experiments

We train different variants of CLIP using Algorithm 1 including Vi T-B/32 and Conv Next in base, large, and XXL configurations on the extensive image-caption Data Comp dataset (Gadre et al., 2024) for 20, 000 steps. Throughout the training process, we keep the text encoder frozen and update solely the vision encoder. Implementation and training details are provided in the supplementary materials. To analyze the performance of CLIP-JEM, we first evaluate it in the text-to-image generation setting Section 4.1). Next, we demonstrate its effectiveness as a guiding model (Section 4.2). Lastly, we show that CLIP-JEM can serve as an improved evaluation metric compared to the vanilla CLIP, attributed to its robustness and awareness of perceptual quality.

4.1 Text-To-Image Generation

Similar to the training procedure, we perform pixel-space optimization to generate samples. Given a target prompt, we initialize the image as a random sample from a uniform distribution and perform 50 steps to

Published in Transactions on Machine Learning Research (07/2025)

Table 1: MS-COCO text-to-image generation results. Frechet Inception Distance (FID, lower is better) and CLIPSIM (higher is better) results, along with model sizes. ZS indicates whether the model was trained on MS-COCO.

Method #Params. ZS FID CLIPSIM

Stack-GAN - 74.1 - Attn GAN 230M 35.5 27.7 Cog View 4,000M 27.1 33.2 DALL-E 12,000M 27.5 - GLIDE 6,000M 12.2 - LDM-KL-8 1,450M 23.3 - LDM-KL-8-G 1,450M 12.6 - LAFITE 226M 26.9 - Style GAN-T 1100M 13.9 - NÜWA 870M 12.9 34.3 Giga GAN 1034M 9.1 -

CLIPAG L 200M 82.0 30.3 CLIPAGL 200M 47.6 33.4

CLIP-JEMVi T 88M 68.3 34.5 CLIP-JEMB 88M 34.8 31.6 CLIP-JEML 200M 26.7 31.7 CLIP-JEMXXL 846M 23.4 33.5

maximize the cosine similarity with respect to the text, using an Adam W optimizer with no momentum. We evaluate the performance of CLIP-JEM in two main setups: image quality and compositionality.

Quality and Fidelity We use CLIP-JEM to generate 30, 000 samples from the MS-COCO dataset (Lin et al., 2015) and report the results in FID2 and CLIPSIM using Vi T-B/32 in Table 1. We compare CLIP-JEM to various GAN-based, diffusion-based, and autoregressive text-to-image models (see more details in the supplementary materials). We report the number of parameters of the baselines and the size of the vision encoder used for CLIP-JEM . As shown in Table 1, scaling up the model size significantly benefits our method, substantially improving the FID scores. Specifically, in the XXL case, CLIP-JEM performs similarly to the unguided Latent Diffusion Model (LDM-KL-8) and outperforms DALL-E and Cog View despite being smaller. Additionally, we train a CLIPAG baseline using the same training configuration and architecture to better demonstrate the effectiveness of our image-text energy objective. We report the results of two variants of CLIPAG using Conv Next Large with and without multiview augmentations, denoted as CLIPAGL and CLIPAGL , respectively. As can be seen in CLIPAG results, the multiview augmentation pipeline is crucial (improves the FID from 82.0 to 47.6), highlighting the unsatisfying quality of its gradients. Interestingly, using the same model with CLIP-JEM leads to a much-improved FID score (26.7) without applying any augmentation. This strongly indicates the effectiveness of introducing our contrastive energy loss and our method s improved capability of generating realistic samples. In the CLIPSIM metric, CLIPAGL leads to a better result than CLIP-JEM . We attribute this to the fact that the multiview augmentations lead to unrealistic images (which impair the FID) that highly align with the text (increasing the CLIPSIM). Overall, these results indicate that CLIP-JEM achieves both of its goals extending JEM training into text-to-image generation and improving CLIPAG s photorealism.

Compositionallity We compare CLIP-JEM with other generative models using T2I Comp Bench (Huang et al., 2023), which evaluates open-world compositional text-to-image generation across attribute binding (color, shape and texture), object relationship (spatial and non-spatial) and complex. Despite advances in text-to-image generation, models still struggle to compose objects with different characteristics and

2We use the same evaluation codes with DM-GAN, which is available at https://github.com/Minfeng Zhu/DM-GAN

Published in Transactions on Machine Learning Research (07/2025)

Table 2: T2I-Comp Bench results. CLIP-JEM performance on the text-to-image generation compositionality benchmark.

Model Attribute Binding Object Relationship Complex Average Color Shape Texture Spatial Non-spatial

SD1.4 0.3765 0.3576 0.4156 0.1246 0.3079 0.3080 0.3150 SD2 0.5065 0.4221 0.4922 0.1342 0.3127 0.3386 0.3677 Composable (SD2) 0.4063 0.3299 0.3644 0.0800 0.2980 0.2898 0.2947 Structured (SD2) 0.4990 0.4218 0.4900 0.1386 0.3111 0.3355 0.3660 Attn-Exct (SD2) 0.6400 0.4517 0.5963 0.1455 0.3109 0.3401 0.4141

CLIP-JEMVi T 0.5305 0.5159 0.5566 0.0262 0.3343 0.2900 0.3756 CLIP-JEMB 0.5799 0.5122 0.6154 0.0708 0.3145 0.2938 0.3978 CLIP-JEML 0.5715 0.5202 0.6072 0.0768 0.3152 0.3020 0.3988 CLIP-JEMXXL 0.5670 0.5021 0.6132 0.0841 0.3205 0.3129 0.4000

relationships into a coherent image. Following Comp Bench procedure, we generate 10 samples per prompt and average the results on the validation sets of each category, using the same evaluation metrics as in the original paper (B-VQA, Uni Det, CLIP, and 3-in-1 for the attribute binding, spatial, non-spatial, and complex categories). Table 2 reports CLIP-JEM s results compared to top-performing models, including an average score across six categories. CLIP-JEM excels in attribute binding, associating attributes with corresponding objects in generated images, and performs well in the non-spatial relationship category (e.g., speak to and look at ). However, it scores lower in the spatial relationship category due to CLIP s known limitations in spatial compositionality (Yuksekgonul et al., 2022; Lewis et al., 2022). In the complex category, involving multiple objects and attributes, CLIP-JEM performs well despite containing spatial relationships, due to its strengths in attribute binding and non-spatial understanding. Overall, CLIP-JEM outperforms most baselines in compositional generation, despite being smaller. Specifically, it surpasses methods deliberately designed to tackle compositionality and rely on a much stronger generative model (Stable Diffusion v2). We attribute this success to the generative and discriminative objectives combination, enabling effective alignment with compositional prompts.

4.2 Text Guidance Using CLIP-JEM

A cow on the beach Unconditional

samples A cute corgi on a red

A cosy living room with

a fireplace

A barn surrounded by

a green field

Text-conditioned samples

A golden bird

A beach in Thailand

Figure 4: Diffusion guidance using CLIP-JEM. Converting an unconditional diffusion model into a text-based one with CLIP-JEM. In each row, we plot the unconditional alongside the guidance results using the same seed.

As shown in fig. 1, CLIP-JEM possesses semantically meaningful gradients with respect to a given text. In this section, we demonstrate the guidance capability of our method in two main settings: diffusion guidance and improving CLIP-based generative frameworks, utilizing a Conv Next Large model.

Diffusion guidance We utilize CLIP-JEM as a guiding technique to transform unconditional diffusion models trained on Image Net (Dhariwal & Nichol, 2021; Ahn et al., 2024) into text-conditioned ones using only 25 DDIM steps 25 (Song et al., 2020). In each DDIM step t, we update the estimations for the clean image ˆx0 using the gradients of CLIP-JEM (eq. (6)) and use it to compute xt 1 and continue the reverse DDIM process.

ˆx0 = ˆx0 + s ˆx0 Cosine Similarity(f I θ (ˆx0), f T θ (T)) (6)

Published in Transactions on Machine Learning Research (07/2025)

A drawing of a

Horse eating a

A 3D rendering

of a temple

Family vacation to Wolt Disney World

Cat - 73.76%

Cat - 65.88%

Cupcake - 62.87%

Horse - 37.07%

Temple - 98.14%

Temple - 94.96%

Vacation - 66.11%

Vacation - 45.54%

Self - 95.21%

Self - 99.22%

CLIP CLIP-JEM

(a) CLIPDraw results

A wood carving depicting a scene

A sculpture inspired

by the works of Constantin Brâncuși

An impressionist

painting of a blooming field of

A watercolor painting of a city

A digital illustration

of a futuristic

CLIP CLIP-JEM

(b) VQGAN+CLIP results

Figure 5: Improving CLIP-Based Generative Frameworks via CLIP-JEM. Qualitative results of CLIP-JEM compared to CLIP using CLIPDraw and VQAGAN+CLIP in a zero-shot setting.

We provide qualitative results of text-based diffusion guidance in Figure 4, showcasing the capability of CLIP-JEM to convert unconditional diffusion models to text-based ones, thereby enhancing their utility.

Improving CLIP-Based Generative Frameworks CLIP is widely used in text-to-image generation frameworks to update the resulting image to better align with the target text in CLIP s space. This is typically done using the input gradients of the CLIP s vision encoder with respect to the text (maximizing the cosine similarity). However, these gradients tend to be semantically meaningless, as demonstrated in Figure 1. To mitigate this, a multiview augmentation pipeline is often employed to acquire semantically meaningful gradients. Our method, on the other hand, inherently produces gradients that convey rich semantic information. Consequently, using CLIP-JEM can eliminate the need for multiview augmentations, thereby improving computational efficiency. To demonstrate this, we experiment with two such frameworks: CLIPDraw (Frans et al., 2022) and VQGAN+CLIP (Crowson et al., 2022). CLIPDraw generates drawings by optimizing the parameters of Bézier curves using CLIP, and VQGAN+CLIP updates the latent code of a VQGAN to enable text-to-image generation. In both cases, we do not perform multiview augmentations and report the results with augmentations in the supplementary materials. In Figure 5a, we present the results of CLIPDraw on target texts from the original paper, along with CLIP s top prediction for each prompt. As shown, using CLIP-JEM leads to significantly better visual results, which are more aligned with the text than those generated by the vanilla CLIP. The high percentages in the top predictions imply that the images generated by CLIP have an adversarial nature, maximizing the score without performing significant modifications. Next, we evaluate the effect of using our approach to guide VQGAN. To this end, we prompted Chat GPT to provide artistic target texts, aligning with the original paper s domain. As seen in Figure 5b, removing the augmentation pipeline results in non-meaningful outputs, whereas CLIP-JEM generates semantically meaningful images. These results strongly attest to the improved guidance capabilities of CLIP-JEM.

4.3 CLIP-JEM as an Evaluation Metric

CLIP is often used to evaluate text-to-image generative tasks by measuring cosine similarity between textual descriptions and images in its embedding space, known as CLIP-T. We compare CLIP-JEM with the standard CLIP model, focusing on the Vi T-B/32 commonly used for this purpose. Using TEd Bench (Kawar et al., 2023), we evaluate CLIP-T scores for various inputs, including outputs from a top-performing generative model (Imagic), source images ( No Edit ), and Noise images. Results are shown in table 3 under Vanilla . We also assess robustness under adversarial attacks with a low perturbation budget (ϵ = 2/255). These attacks aim to increase scores for bad images and decrease scores for good ones. Our findings indicate that CLIP is highly susceptible to adversarial attacks, resulting in higher CLIP-T scores for non-edited and noise images than for Imagic outputs. In contrast, CLIP-JEM remains robust, maintaining higher scores for Imagic outputs even under adversarial attacks.

Published in Transactions on Machine Learning Research (07/2025)

Table 3: Robustness To Adversarial Perturbation.

Input images CLIP CLIP-JEM Vanilla Attack Vanilla Attack

Imagic 0.3031 0.2053 0.2016 0.1951 No Edit 0.2740 0.3547 0.1811 0.1866 Noise 0.2033 0.3037 0.0905 0.0959

Figure 6: Sensitivity to perceptual quality.

We further analyze the effect of perceptual quality on CLIP-T scores by blending Imagic images (x Imagic) with uniform noise: λx Imagic + (1 λ)u where u U[0, 1]. We plot CLIP-T scores for varying λ values in Figure 6, normalizing scores to 1.0 at λ = 0. While the standard CLIP model prefers noisier versions up to λ = 0.45, CLIP-JEM s scores decrease with increasing noise, indicating greater sensitivity to image quality, attributed to the contrastive energy objective making the model to assign high energy to non-real images.

5 Discussion and Conclusion

In this work, we introduce CLIP-JEM, a novel approach that extends Joint Energy Models to the multimodal domain using CLIP. Through extensive evaluations, CLIP-JEM demonstrates its ability to generate highquality, compositionally coherent images, achieving competitive results on the MS-COCO dataset and excelling in the T2I Comp Bench benchmark. Moreover, CLIP-JEM showcases strong guiding capabilities, significantly improving the performance of CLIP-based generative frameworks and converting unconditional diffusion models to text-based ones. Additionally, CLIP-JEM proves to be a robust and perceptually aware evaluation metric, maintaining high scores under adversarial attacks and showing greater sensitivity to image quality than the standard CLIP model. We hope that the insights and findings presented in this paper will inspire further exploration and advancements in multimodal JEM research.

Donghoon Ahn, Hyoungwon Cho, Jaewon Min, Wooseok Jang, Jungwoo Kim, Seon Hwa Kim, Hyun Hee Park, Kyong Hwan Jin, and Seungryong Kim. Self-rectifying diffusion sampling with perturbed-attention guidance. ar Xiv preprint ar Xiv:2403.17377, 2024.

Tsachi Blau, Roy Ganz, Chaim Baskin, Michael Elad, and Alex Bronstein. Classifier robustness enhancement via test-time transformation, 2023.

Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In Symposium on Security and Privacy (SP), 2017.

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, and Ran Xu. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset, 2025. URL https://arxiv.org/abs/2505.09568.

Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Edward Raff. VQGAN-CLIP: open domain image generation and editing with natural language guidance. In ECCV 2022 - 17th European Conference, volume 13697, pp. 88 105, 2022.

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780 8794, 2021.

Published in Transactions on Machine Learning Research (07/2025)

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. Cogview: Mastering text-to-image generation via transformers, 2021.

Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Brandon Tran, and Aleksander Madry. Learning perceptually-aligned representations via adversarial robustness. ar Xiv preprint ar Xiv:1906.00945, 2019.

Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured diffusion guidance for compositional text-to-image synthesis, 2023.

Kevin Frans, Lisa B. Soros, and Olaf Witkowski. Clipdraw: Exploring text-to-drawing synthesis through language-image encoders. In Neur IPS, 2022.

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems, 36, 2024.

Rinon Gal, Or Patashnik, Haggai Maron, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. Stylegannada: Clip-guided domain adaptation of image generators. ACM Trans. Graph., 41(4):141:1 141:13, 2022.

Roy Ganz and Michael Elad. BIGRoc: Boosting image generation via a robust classifier. Transactions on Machine Learning Research, 2022. ISSN 2835-8856.

Roy Ganz and Michael Elad. Clipag: Towards generator-free text-to-image generation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3843 3853, 2024.

Roy Ganz, Oren Nuriel, Aviad Aberdam, Yair Kittenplon, Shai Mazor, and Ron Litman. Towards models that can see and read. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 21718 21728, 2023.

Roy Ganz, Yair Kittenplon, Aviad Aberdam, Elad Ben Avraham, Oren Nuriel, Shai Mazor, and Ron Litman. Question aware vision transformer for multimodal reasoning. ar Xiv preprint ar Xiv:2402.05472, 2024.

I. Goodfellow, J. Shlens, and C. Szegedy. Explaining and Harnessing Adversarial Examples. In International Conference on Learning Representations, ICLR, 2015.

Will Grathwohl, Kuan-Chieh Wang, Jörn-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, and Kevin Swersky. Your classifier is secretly an energy based model and you should treat it like one. ar Xiv preprint ar Xiv:1912.03263, 2019.

Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. ar Xiv preprint ar Xiv:2307.06350, 2023.

Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Adversarial examples are not bugs, they are features. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32, 2019.

Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023a.

Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up gans for text-to-image synthesis, 2023b.

Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6007 6017, 2023.

Published in Transactions on Machine Learning Research (07/2025)

Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation, 2022. URL https://arxiv.org/abs/2110.02711.

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

Gihyun Kwon and Jong Chul Ye. Clipstyler: Image style transfer with a single text condition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18062 18071, June 2022.

Yann Le Cun, Sumit Chopra, Raia Hadsell, M Ranzato, and Fujie Huang. A tutorial on energy-based learning. Predicting structured data, 1(0), 2006.

Martha Lewis, Nihal V Nayak, Peilin Yu, Qinan Yu, Jack Merullo, Stephen H Bach, and Ellie Pavlick. Does clip bind concepts? probing compositionality in large image models. ar Xiv preprint ar Xiv:2212.10537, 2022.

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pp. 12888 12900. PMLR, 2022.

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pp. 19730 19742. PMLR, 2023.

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015. URL https://arxiv.org/abs/1405.0312.

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023a.

Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B. Tenenbaum. Compositional visual generation with composable diffusion models, 2023b.

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101, 2017.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, ICLR, 2018.

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Baolin Wu, Andrew Y Ng, et al. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, pp. 7, 2011.

Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mc Grew, Ilya Sutskever, and Mark Chen. GLIDE: towards photorealistic image generation and editing with textguided diffusion models. In International Conference on Machine Learning, ICML 2022, volume 162, pp. 16784 16804, 2022.

Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, pp. 2065 2074, 2021.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748 8763, 2021.

Published in Transactions on Machine Learning Research (07/2025)

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation, 2021.

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with CLIP latents. Co RR, abs/2204.06125, 2022. doi: 10.48550/ar Xiv.2204.06125. URL https://doi.org/10.48550/ar Xiv.2204.06125.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684 10695, June 2022a.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2022b.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2022c.

Hadi Salman, Andrew Ilyas, Logan Engstrom, Ashish Kapoor, and Aleksander Madry. Do adversarially robust imagenet models transfer better? Advances in Neural Information Processing Systems, 33:3533 3545, 2020.

Shibani Santurkar, Dimitris Tsipras, Brandon Tran, Andrew Ilyas, Logan Engstrom, and Aleksander Madry. Image Synthesis with a Single (Robust) Classifier. 2019.

Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, and Timo Aila. Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis, 2023.

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. ar Xiv preprint ar Xiv:2010.02502, 2020.

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, ICLR, 2014.

D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry. Robustness May Be at Odds with Accuracy. In International Conference on Learning Representations, ICLR, 2019.

Yael Vinker, Ehsan Pajouheshgar, Jessica Y. Bo, Roman Christian Bachmann, Amit Haim Bermano, Daniel Cohen-Or, Amir Zamir, and Ariel Shamir. Clipasso: semantically-aware object sketching. ACM Trans. Graph., 41(4):86:1 86:11, 2022.

Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 681 688. Citeseer, 2011.

Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, and Nan Duan. Nüwa: Visual synthesis pre-training for neural visual world creation, 2021. URL https://arxiv.org/abs/2111.12417.

Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks, 2017.

Xiulong Yang and Shihao Ji. Jem++: Improved techniques for training jem. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6494 6503, 2021.

Xiulong Yang, Qing Su, and Shihao Ji. Towards bridging the performance gaps of joint energy-based models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15732 15741, 2023.

Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations, 2022.

Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks, 2017.

Published in Transactions on Machine Learning Research (07/2025)

Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, and Tong Sun. Lafite: Towards language-free training for text-to-image generation, 2022.

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing visionlanguage understanding with advanced large language models. ar Xiv preprint ar Xiv:2304.10592, 2023.

Yao Zhu, Jiacheng Ma, Jiacheng Sun, Zewei Chen, Rongxin Jiang, Yaowu Chen, and Zhenguo Li. Towards understanding the generative capability of adversarially robust classifiers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7728 7737, 2021.

Published in Transactions on Machine Learning Research (07/2025)

A Detailed background on energy-based models

Energy based models Energy-based models (EBMs) (Le Cun et al., 2006) are a class of probabilistic models that define a probability distribution over the data points via an energy function. Specifically, EBMs utilize the fact that any probability density p(x) for x RD can be written as

pθ(x) = exp( Eθ(x))

where Eθ( ) is the energy function, which assigns scalar values to data points (Eθ : RD R) and Z(θ) is a normalizing constant. In EBMs, the probability of a given data point is determined by its energy, where lower energy indicates a higher probability, as can be seen in Equation (7).

Training such a model involves learning the parameters θ of the energy function, typically modeled by a neural network, to minimize the energy of observed positive data points, x+ p(x), while maximizing the energy of negative samples, x pθ(x), generated by the model during training. The training converges when the model cannot detect whether a sample is positive or negative , attesting that pθ and p have a similar distribution.

EBMs require sampling from pθ, both during training to acquire negative samples and during inference to synthesize new images. In practice, drawing such samples is done via Stochastic Gradient Langevin Dynamics (SGLD), initialized with x0 p0(x) from a canonical distribution (e.g. Uniform over the input domain), and updated iteratively by

xi+1 = xi α

xi + ϵ, ϵ N(0, αI), (8)

where α is the sampler s step size.

Joint energy models Recently, Grathwohl et al. (2019) observed that one can utilize a classifier to model an energy function and train it accordingly. Given a classifier fθ which maps inputs into K values, known as logits (fθ : RD RK), one can parameterize a conditional distribution using the Softmax function:

pθ(y|x) = exp(fθ(x)y) P

y exp(fθ(x)y ), (9)

where fθ(x)y is the logit corresponding with the yth class label. The key observation is that one can construct expressions for pθ(x, y) and pθ(x) using the classifier s logits. The joint distribution of a data point x and a label y is given via

pθ(x, y) = exp(fθ(x)y)

Z(θ) , (10)

where Z(θ) = R

x exp(fθ(x)y)dx is an intractable normalizing factor. Accordingly, we can define a joint energy function Eθ(x, y) = fθ(x)y. Using marginalization over y, we can obtain the unconditional distribution,

y pθ(x, y) =

y exp(fθ(x)y)

Z(θ) . (11)

Thus, in the unconditional case, the energy function is given as Eθ(x) = log P

y exp(fθ(x)y).

With this interpretation, one can train a joint model for both discriminative and generative modeling, optimizing both classification and EBM objectives. Such models lead to good classification capabilities, showcasing impressive adversarial robustness while being able to generate new data samples. Nevertheless, training such models often suffers from instability and even divergence, making it applicable mainly for small datasets but not for real-world ones.

Published in Transactions on Machine Learning Research (07/2025)

Arch. BS disc. BS gen. #Steps LR WD Sched. Warmup Adv. ϵ Tadv TJEM γ α1 α2

Vi T-B/32 256 32

1 10 4 Cosine 200 3.0 5 50 0.1 1.5 0.025 Conv Next-B 128 32 2 10 5

Conv Next-L 128 16 2 10 5

Conv Next-XXL 32 8 2 10 6

Table 4: Implementation details. We provide the training hyperparameters of CLIP-JEM for the different architectures (BS disc. and gen. stands for the discriminative and generative batch sizes, respectively).

Arch. Time [Sec.] Memory [M]

Vi T-B/32 1.4 2787 Conv Next-B 2.6 3009 Conv Next-L 2.4 4523 Conv Next-XXL 4.5 11837

Table 5: Sampling time and memory. The time per-sample in seconds and memory consumption of the different considered models.

B Implementation details

Training hyperparameters We implement our method upon the Open Clip codebase3. We consider the following model architectures from the model zoo convnext_xxlarge, convnext_large_d, convnext_base_w and Vi T-B-32 with the following pretrained weights, respectively laion2b_s34b_b82k_augreg_soup, laion2b_s26b_b102k_augreg, laion2b_s13b_b82k_augreg and openai. In table 4, we report the training hyperparameters for CLIP-JEM . We use these hyperparameters to train the different architectures on Data Comp. To train our models, we use 8 A40 GPUs for training. Training the largest variant (Conv Next XXL) for 20K iterations takes 10 days. During training, in the generation process of the negative samples we employ a momentum of 0.9 to the Adam W optimizer. However, throughout the inference phase, we do not utilize momentum at all. We aim to make our code and pretrained models publicly available upon acceptance.

Experimental settings To measure the quality and fidelity of the generated images of our method, we compare it to strong baselines using MS-COCO dataset (table 1). Specifically, we compare CLIP-JEM to textto-image generative models of different types: (i) GAN-based Stack-GAN (Zhang et al., 2017), Attn GAN (Xu et al., 2017), LAFITE (Zhou et al., 2022), Style GAN-T (Sauer et al., 2023), and Giga GAN (Kang et al., 2023b) (ii) Diffusion-based GLIDE (Nichol et al., 2022), and LDM (Rombach et al., 2022c), and Autoregressive ones DALL-E (Ramesh et al., 2021), Cog View (Ding et al., 2021), and NÜWA (Wu et al., 2021). Our reported results and model sizes originate from the respective papers. The CLIPAG baseline results were obtained by us, by removing the contrastive energy loss term, using the same architecture and hyperparameters. As for the Comp Bench results (table 2), we report the one from the benchmark s paper.

As for the improving CLIP-based generative frameworks experiments, we use an ADM-based model (Dhariwal & Nichol, 2021) with perturbed-attention guidance (Ahn et al., 2024) which strongly improves the unconditional generation as our baseline for diffusion guidance.

Sampling time and memory We train 4 different architectures of CLIP-JEM. As different model size and structure affects the runtime complexity, we report the time of our generation process using a batch size of 1 using an Nvidia A40 GPU in table 5. As expected, increasing the model size leads to more memory consumption and increases the time per sample. However, the Conv Next-B generation time is slightly larger than the Conv Next-L. This is due to the fact the we utilize the wide variant of the Conv Next-B, which is not available in Conv Next-L.

3https://github.com/mlfoundations/open_clip

Published in Transactions on Machine Learning Research (07/2025)

Algorithm 1 CLIP-JEM Training. Given CLIP image and text encoders f I θ ( ) and f T θ ( ), image-text dataset D, adversarial budget ϵ, adversarial and energy step-sizes α1, α2, energy loss coefficient γ, and number of adversarial and generation iterations Tadv, TJEM: while not converged do

Sample (I, T) from dataset D /* Contrastive adversarial loss */ δ0 0 for t from 0 to Tadv do

δt+1 = Πϵ(δt + α1 Clip Loss(f I θ (I + δt), f T θ (T))) end Iadv = I + δTadv Ladv = Clip Loss(f I θ (Iadv), f T θ (T)) /* Contrastive energy loss */ Sample initial negative sample I. Optimizer Adam W(params = I, lr = α2) for t from 0 to TJEM do

LJEM = Clip Loss(f I θ ( I + βn), f T θ (T)) /* n N(0, I), β is a small scalar */ Calculate LJEM/ I and perform an optimizer step end LJEM = Clip Loss(f I θ (Concat(I, I)), f T θ (T)) /* Update the vision encoder */ L = Ladv + γ LJEM Calculate L/ θ and update CLIP image encoder f I θ ( ) end

C Training protocol

In algorithm 1, we detail to overall training procedure of CLIP-JEM. First, we draw image-text batches from dataset D and apply Tadv iterations to obtain the adversarial visual inpputs Iadv which we use to calculate the adversarial loss Ladv. Next, we perform an iterative pixel-space optimization to obtain negative samples I. Specifically, this is done using TJEM iterations using Adam W optimizer. We form our joint energy based loss using the positive and negative samples. Lastly, we combine these two terms to formulate our overall objective, which we use to update the vision encoder weights. We repeat this process until convergence.

D Ablation study

Architecture In table 1, we report the results of both Vi T-B-32 and Conv Next base using CLIP-JEM. Notably, despite the two architectures share a similar capacity, the Conv Next FID score is significantly better (by 33.5 points). We hypothesize that this stems from the improved prior that CNN-based architectures serves. Additionally, generated images from Vi T contain grid artifact from the patch processing mechanism. Thus, we mainly focus on Conv Next-based models.

Objectives contribution To highlight the importance of combining the adversarial and energy contrastive losses, we train the same model for 1, 000 iterations using (i) contrastive energy loss; (ii) contrastive adversarial loss; and (iii) contrastive energy loss + contrastive adversarial loss. We plot the results of 16 text prompts from MS-COCO for the resulting models in fig. 7. The results indicate that using only the contrastive energy loss does not produce meaningful outputs. In contrast, employing solely the contrastive adversarial loss results in meaningful but unrealistic content. Remarkably, combining the two objectives (CLIP-JEM s approach) leverages the strengths of both CLIPAG and JEMs, yielding superior outcomes.

Published in Transactions on Machine Learning Research (07/2025)

Contrastive Energy Loss Contrastive Energy Loss + Contrastive Adversarial Loss

Contrastive Adversarial Loss

Figure 7: Objectives ablation study. CLIP-JEM employs a two objectives: Contrastive Adversarial Loss and a Contrastive Energy Loss. In this figure we ablate the effect of each objective, highlighting the importance of each objective. We plot 16 images based on different textual prompts from MS-COCO after 500 training steps.

E Connection to Stochastic Gradient Langevin Dynamics

SGLD enables sampling from a distribution p(x) given its score function x log p(x). The stochasticity in this algorithm is important, as applying a simple gradient method would result in sampling the distribution peaks rather than truly sampling the distribution. The iterative update of SGLD in its discrete form is as follows: xt+1 = xt + α

2 xt log p(xt) + ϵ,

where α is the step size and ϵ N(0, αI) is a Gaussian noise. From eq. (7), the above equation can be also written in terms of energy, xt+1 = xt α

2 xt Eθ(xt) + ϵ,

and in the joint energy case, the update rule becomes

xt+1 = xt α

2 xt Eθ(xt, y) + ϵ.

Note that in our algorithm we add a small Gaussian perturbation to the input, prior to the calculation of the gradient (algorithm 1). This introduces a similar, yet different, stochasticity in our approach, compared with SGLD. To better understand this, we expand the following Taylor series:

x Eθ(x + ϵ, y) x Eθ(x, y) + 2 x Eθ(x, y)T ϵ

+ O( ϵ 2 2).

As our noise is of small variance, the higher order term can be neglected,

x Eθ(x + ϵ, y) x Eθ(x, y) + 2 x Eθ(x, y)T ϵ,

making it similar to the SGLD noisy update step, as the term 2 x Eθ(x, y)T ϵ is a linear transformation of the Gaussian noise, resulting in a colored version of a Gaussian distribution, governed by the Hessian of the energy function.

We demonstrate the randomness introduced by this mechanism using our sampling process in fig. 8. Unlike to SGLD, we utilize an Adam W optimizer for the update step for the generated sample. Thus, we do not use the raw gradient as in SGLD since our optimizer has an adaptive learning rate mechanism. This results in a different effective step size per iteration (α in SGLD). Adam W also enables momentum term, which

Published in Transactions on Machine Learning Research (07/2025)

A cozy cabin in the woods with smoke coming from the chimney

A group of astronauts exploring a distant planet

A medieval knight standing in front of a castle

Figure 8: Stochasticity demonstration of CLIP-JEM sampling.

introduces bias to the gradients. During CLIP-JEM training, we utilize the momentum as it significantly stabilizes the training. However, at inference time, we do not employ momentum, resulting in an unbiased version of the gradients.

F Additional results

We provide additional qualitative results for both improving CLIP-based generative frameworks and textto-image generation. For the former, we provide the results using the same prompts with augmentations (figs. 10 and 11). As can be seen, seamlessly replacing CLIP with CLIP-JEM leads to improved results with and without augmentations. Notably, CLIP-JEM results without augmentations are comparable to the ones of CLIP with augmentation, offering a substantial reduce of computational overhead. For the latter, we provide more generated image (fig. 9) and demonstrate the stochasticity of CLIP-JEM(fig. 8).

G Positioning CLIP-JEM within the Landscape of Generative Models

Comparison with Diffusion Models While CLIP-JEM does not achieve state-of-the-art FID scores in text-to-image generation (see table 1), it offers several notable advantages. It exhibits superior compositional generalization, outperforming Stable Diffusion v1.4 and v2 on the T2I-Comp Bench benchmark (Gadre et al., 2024), as well as surpassing compositionality-oriented diffusion-based methods. This performance stems from its joint energy contrastive adversarial training, which enhances the alignment between textual prompts and generated images. Unlike diffusion pipelines that rely on multi-step denoising in pixel or latent spaces

Published in Transactions on Machine Learning Research (07/2025)

- necessitating a learned decoder and intricate noise scheduling - CLIP-JEM samples directly in pixel space through a straightforward iterative gradient-following process. This approach yields energy values proportional to sample likelihoods, facilitating explicit probabilistic ranking and selection of outputs. We leverage this property to develop a robust text-to-image alignment metric (see section 4.3) and to convert unconditional diffusion models into text-conditioned generators. Furthermore, CLIP-JEM s energy-based formulation allows for seamless "plug-and-play" integration into CLIP-based synthesis workflows; for example, replacing the CLIP encoder in CLIPDraw with CLIP-JEM results in more semantically faithful renderings under the same optimization loop (Frans et al., 2022).

Comparison with Architectures Employing CLIP Image Encoders Leading text-to-image systems incorporate CLIP s image encoder alongside distinct generative backbones rather than utilizing it to generate pixels directly. DALL E 2 (Ramesh et al., 2022) employs a diffusion prior to map CLIP text embeddings to CLIP image embeddings, followed by a modified GLIDE decoder for pixel synthesis. Diffusion CLIP (Kim et al., 2022) integrates a CLIP-based contrastive loss into the reverse diffusion trajectory of pretrained diffusion models to enable zero-shot, text-guided image manipulation. VQGAN-CLIP (Crowson et al., 2022) pairs a pretrained VQGAN generator with CLIP s image encoder to score and steer pixel-space updates via gradient descent. Recently, BLIP3-o (Chen et al., 2025) utilizes the CLIP image encoder in conjunction with a large language model and a diffusion transformer to generate images. In contrast, CLIP-JEM constructs a generative model directly from CLIP itself, eliminating the need for an additional generative model.

H Limitation and Futrue Work

Our current work exhibits two main limitations. First, CLIP-JEM shows limited diversity in its generated samples. The primary source of stochasticity comes from the random initialization process, and while we experimented with noise injection during optimization steps, we maintained a small noise coefficient for stability. This resulted in generated images sharing similar visual characteristics for the same prompt, as evident in fig. 8. Future work could explore the enhancement of sample diversity through increased noise levels during optimization or alternative initialization strategies. Second, CLIP-JEM underperforms on tasks involving spatial relationships, as demonstrated in table 2. This limitation stems from CLIP s inherent constraints in spatial compositionality, as our model initializes from CLIP weights. This aligns with the well-documented limitations of CLIP in handling spatial relationships. Future research could focus on addressing these spatial understanding capabilities, potentially through architectural modifications or specialized training objectives that better capture spatial information.

Published in Transactions on Machine Learning Research (07/2025)

A couple of bikes and a umbrella in a

A bear in the woods

standing on a log.

A beautiful garden

and a dog in the

a bed with some kind of blankets on it

A vintage train traveling through a

countryside.

A bustling market street in an ancient

A small marina with

boats docked there

A ballet dancer performing on a

grand stage

Figure 9: Additional qualitative results. Samples generated by CLIP-JEM using Conv Next XXL.

Published in Transactions on Machine Learning Research (07/2025)

A drawing of a cat Horse eating a

A 3D rendering of

Family vacation to Wolt Disney World

Cat - 73.76%

Cat - 65.88%

Cupcake - 62.87%

Horse - 37.07%

Temple - 98.14%

Temple - 94.96%

Vacation - 66.11%

Vacation - 45.54%

Self - 95.21%

Self - 99.22%

CLIP CLIP-JEM

Cat - 47.57%

Horse - 26.06%

Temple - 93.98%

Vacation - 52.00%

Self - 98.79%

Cat - 18.60%

Horse - 15.41%

Temple - 98.10%

Vacation - 51.59%

Self - 61.02%

Aug+CLIP-JEM

Figure 10: CLIPDraw results. We provide the results of CLIP and CLIP-JEM with and without augmentations.

Published in Transactions on Machine Learning Research (07/2025)

A wood carving depicting a scene

A sculpture inspired

by the works of Constantin Brâncuși

An impressionist

painting of a blooming field of

A watercolor painting of a city

A digital illustration of a futuristic spaceship

Figure 11: VQGAN+CLIP results. We provide the results of CLIP and CLIP-JEM with and without augmentations.