# long_horizon_temperature_scaling__ca07eebd.pdf Long Horizon Temperature Scaling Andy Shih 1 Dorsa Sadigh 1 Stefano Ermon 1 Temperature scaling is a popular technique for tuning the sharpness of a model distribution. It is used extensively for sampling likely generations and calibrating model uncertainty, and even features as a controllable parameter to many large language models in deployment. However, autoregressive models rely on myopic temperature scaling that greedily optimizes the next token. To address this, we propose Long Horizon Temperature Scaling (LHTS), a novel approach for sampling from temperature-scaled joint distributions. LHTS is compatible with all likelihoodbased models, and optimizes for the long horizon likelihood of samples. We derive a temperaturedependent LHTS objective, and show that finetuning a model on a range of temperatures produces a single model capable of generation with a controllable long horizon temperature parameter. We experiment with LHTS on image diffusion models and character/language autoregressive models, demonstrating advantages over myopic temperature scaling in likelihood and sample quality, and showing improvements in accuracy on a multiple choice analogy task by 10%. Our code is available at https://github.com/Andy Shih12/ Long Horizon Temperature Scaling. 1. Introduction Temperature scaling is a simple yet effective technique for rescaling model outputs: lowering the temperature to increase the probability of high-likelihood outcomes, or vice versa. In discriminative settings, tuning the temperature has shown success as a calibration method (Guo et al., 2017; Nixon et al., 2019; Desai & Durrett, 2020). The model outputs a small set of class probabilities, which can be tractably rescaled to match the desired calibration metric. 1Department of Computer Science, Stanford University. Correspondence to: Andy Shih . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). In generative tasks, temperature scaling also serves as a method for controlling the randomness of model outputs, and has shown to be useful for many natural language generation tasks such as summarization and question answering (Liang et al., 2022). Many current models in deployment (Brown et al., 2020; Bommasani et al., 2021) even expose the model temperature as a user-controllable parameter in their API. These autoregressive language models execute temperature scaling one token at a time, rescaling the probability of the next token to be proportional to log p(xi|x1 DKL(h(xk 1|xk, x0)||p(xk 1|xk)) i We can then plug in this likelihood lower bound to LHTS to compute the importance weights for each data point, and finetune q T with Eq. 7, where the inner likelihood is again evaluated with the lower-bound in Eq. 8. Diffusion Models Although diffusion models can also be formulated as a hierarchical latent variable model, they are often trained using a simpler MSE loss on the noise (Ho et al., 2020). Nevertheless, LHTS is still directly applicable by scaling the loss for each point by the importance weight. L(q T ) = (9) Ek,x0,ϵ h w T (x0)||ϵ ϵq T ( αkx0 + 1 αkϵ, k)||2i We can apply LHTS in exactly the same way for other likelihood-based models by scaling the log-likelihood loss of each datapoint by its importance weight. For autoregressive models, however, we can take advantage of the autoregressive factorization to derive a variance-reduced formulation of LHTS, which we describe next. 4.2. Variance-Reduced LHTS on Autoregressive Models To apply LHTS to autoregressive models, we first rewrite the LHTS objective from Eq. 7 into a form that is amenable to autoregressive architectures by first sampling the index i uniformly, then the prefix x