# visual_generation_without_guidance__2d298c2b.pdf Visual Generation Without Guidance Huayu Chen* 1 Kai Jiang* 1 2 Kaiwen Zheng 1 Jianfei Chen 1 Hang Su 1 Jun Zhu 1 2 Classifier-Free Guidance (CFG) has been a default technique in various visual generative models, yet it requires inference from both conditional and unconditional models during sampling. We propose to build visual models that are free from guided sampling. The resulting algorithm, Guidance-Free Training (GFT), matches the performance of CFG while reducing sampling to a single model, halving the computational cost. Unlike previous distillation-based approaches that rely on pretrained CFG networks, GFT enables training directly from scratch. GFT is simple to implement. It retains the same maximum likelihood objective as CFG and differs mainly in the parameterization of conditional models. Implementing GFT requires only minimal modifications to existing codebases, as most design choices and hyperparameters are directly inherited from CFG. Our extensive experiments across five distinct visual models demonstrate the effectiveness and versatility of GFT. Across domains of diffusion, autoregressive, and maskedprediction modeling, GFT consistently achieves comparable or even lower FID scores, with similar diversity-fidelity trade-offs compared with CFG baselines, all while being guidance-free. Code: https://github.com/thu-ml/GFT. 1. Introduction Low-temperature sampling is a critical technique for enhancing generation quality by focusing only on the model s high-likelihood areas. Visual models mainly achieve this via Classifier-Free Guidance (CFG) (Ho & Salimans, 2022). As illustrated in Fig. 1 (left), CFG jointly optimizes the target conditional model and an extra unconditional model during training, and combines them to define the sampling process. *Equal contribution 1Department of Computer Science & Technology, Tsinghua University 2Sheng Shu, Beijing, China. Correspondence to: Jun Zhu . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). (a) CFG Training Uncond loss (b) GFT (ours) Guided Sampling Guidance-Free Generation Implicit sampling model Implicit conditional model Figure 1: Comparison of GFT and CFG method. GFT shares CFG s training objective but has a different parameterization technique for the conditional model. This enables direct training of an explicit sampling model. By altering the guidance scale s, it can flexibly trade off image fidelity and diversity, while significantly improving the sample quality. Due to its effectiveness, CFG has been adopted as a default technique for a wide spectrum of visual generative models, including diffusion (Ho et al., 2020), autoregressive (AR) (Chen et al., 2020; Tian et al., 2024), and masked-prediction models (Chang et al., 2022; Li et al., 2023). However, CFG is not problemless. First, the reliance on an extra unconditional model doubles the sampling cost compared with a vanilla conditional generation. Second, this reliance complicates the post-training of visual models when distilling pretrained diffusion models (Meng et al., 2023; Luo et al., 2023; Yin et al., 2024) for fast inference or applying RLHF techniques (Black et al., 2023; Chen et al., 2024b), the extra unconditional model needs to be specially considered in the algorithm design. Third, this also renders a sharp difference from low-temperature sampling in language models (LMs), where a single model is sufficient to represent the sampling distributions across various temperatures. Similarly following LMs approach to divide model output by a constant temperature value is generally ineffective in visual sampling (Dhariwal & Nichol, 2021), even for visual AR models with similar architecture to LMs Visual Generation Without Guidance β = 0.1 Figure 2: Impact of adjusting GFT sampling temperature β for guidance-free Di T-XL/2. GFT achieves similar results to CFG without requiring dual model inference at each step. More examples are in Figure 13. (Sun et al., 2024). All these lead us to ask, can we effectively control the sampling temperature for visual models using one single model? Existing attempts like distillation methods for diffusion models (Meng et al., 2023; Luo et al., 2023; Yin et al., 2024) and alignment methods for AR models (Chen et al., 2024b) are not ultimate solutions. They all rely heavily on pretrained CFG networks for loss definition and do not support training guidance-free models from scratch. Their two-stage optimization pipeline may also lead to performance loss compared with CFG, even after extensive tuning. Generalizability is also a concern. Current methods are typically tailored for either continuous diffusion models or discrete AR models, lacking the versatility to cover all domains. We propose Guidance-Free Training (GFT), a foundational algorithm for building visual generative models with no guidance. GFT matches CFG in performance while requiring only a single model for temperature-controlled sampling, effectively halving sampling costs compared with CFG. GFT offers stable and efficient training with the same convergence rate as CFG, almost no extra memory usage, and only 10 20% additional computation per training update. GFT is highly versatile, applicable in all visual domains within CFG s scope, including diffusion, AR, and masked models. The core idea behind GFT is to transform the desired sampling model into easily learnable forms. GFT optimizes the same conditional objective as CFG. However, instead of aiming to learn an explicit conditional network, GFT defines the conditional model implicitly as the linear interpolation of a sampling network and the unconditional network (Figure 1). By training this implicit model, GFT directly optimizes the underlying sampling network, which is then employed for visual generation without guidance. In essence, one can consider GFT simply as a conditional parameterization technique in CFG training. This perspective makes GFT extremely easy to implement based on existing codebases, requiring only a few lines of modifications and with most design choices and hyperparameters inherited. We verify the effectiveness and efficiency of GFT in both class-to-image and text-to-image tasks, spanning 5 distinctive types of visual models: Di T (Peebles & Xie, 2023), VAR (Tian et al., 2024), Llama Gen (Sun et al., 2024), MAR (Li et al., 2024) and LDM (Rombach et al., 2022). Across all models, GFT enjoys almost lossless FID in fine-tuning existing CFG models into guidance-free models (Sec. 5.2). For instance, we achieve a guidance-free FID of 1.99 for the Di T-XL model with only 2% of pretraining epochs, while the CFG performance is 2.11. This surpasses previous distillation and alignment methods in their respective application domains. GFT also demonstrates great superiority in building guidance-free models from scratch. With the same amount of training epochs, GFT models generally match or even outperform CFG models, despite being 50% cheaper in sampling (Sec. 5.3). By taking in a temperature parameter as model input, GFT can achieve a flexible diversity-fidelity trade-off similar to CFG (Sec. 5.4). 2. Background 2.1. Visual Generative Modeling Continuous diffusion models. Diffusion models (Ho et al., 2020) define a forward process that gradually injects noises into clean images from data distribution p(x): xt = αtx + σtϵ, where t [0, 1], and ϵ is standard Gaussian noise. αt, σt defines the denoising schedule. We have pt(xt) = Z N(xt|αtx, σ2 t I)p(x)dx, where p0(x) = p(x) and p1 N(0, 1). Given data following p(x, c), we can train conditional diffusion models by predicting the Gaussian noise added to xt. min θ Ep(x,c),t,ϵ ϵθ(xt|c) ϵ 2 2 . (1) Visual Generation Without Guidance More formally, Song et al. (2021) proved that Eq. (1) is essentially performing maximum likelihood training with evidence lower bound (ELBO). Also, the denoising model ϵ θ eventually converges to the data score function: ϵ θ(xt|c) = σt xt log pt(xt|c) (2) Given condition c, ϵθ can be leveraged to generate images from pθ(x|c) by denoising noises from p1 iteratively. Discrete AR & masked models. AR models (Chen et al., 2020) and masked-prediction models (Chang et al., 2022) function similarly. Both discretize images x into token sequences x1:N and then perform token prediction. Their maximum likelihood training objective can be unified as min θ Ep(x1:N,c) X i pθ(xn|x 0, thereby substantially improving sample quality. Discrete CFG. Besides diffusion, CFG is also a critical sampling technique in discrete visual modeling (Li et al., 2023; Team, 2024; Tian et al., 2024; Xie et al., 2024). Though the guidance operation performs on the logit space instead of the score field: ℓs θ(xn|x