# model_alignment_as_prospect_theoretic_optimization__933acee5.pdf Model Alignment as Prospect Theoretic Optimization Kawin Ethayarajh 1 Winnie Xu 2 Niklas Muennighoff 2 Dan Jurafsky 1 Douwe Kiela 1 2 Abstract Kahneman & Tversky s prospect theory tells us that humans perceive random variables in a biased but well-defined manner (1992); for example, humans are famously loss-averse. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases the success of these objectives (e.g., DPO) over crossentropy minimization can partly be ascribed to them belonging to a family of loss functions that we call human-aware losses (HALOs). However, the utility functions these methods attribute to humans still differ from those in the prospect theory literature. Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. We call this approach KTO, and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B, despite only learning from a binary signal of whether an output is desirable. More broadly, our work suggests that there is no one HALO that is universally superior; the best loss depends on the inductive biases most appropriate for a given setting, an oft-overlooked consideration. 1. Introduction Aligning generative models with human feedback has been successfully used to make generations more helpful, factual, and ethical, among other desiderata (Ouyang et al., 2022; Tian et al., 2023). For LLMs, alignment methods such as RLHF and DPO have consistently proven to be more beneficial than doing supervised finetuning (SFT) alone. However, human feedback is often discussed only in the context of preferences (e.g., output yw yl for input x), even though it can take many forms (e.g., approval/disapproval of y given 1Stanford University (first author was an intern at Contextual AI) 2Contextual AI. Correspondence to: Kawin Ethayarajh . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). Figure 1. The utility that a human gets from the outcome of a random variable, as implied by different human-aware losses (HALOs). Notice that the implied value functions share properties such as loss aversion with the canonical human value function in prospect theory (Tversky & Kahneman, 1992). x). This is because preferences, despite being a kind of data that is relatively scarce and expensive to collect in practice (Casper et al., 2023), are required by the alignment methods shown to work best RLHF (Christiano et al., 2017) and DPO (Rafailov et al., 2023). To understand why these methods work so well, and whether feedback needs to be in preference form, we frame alignment through the lens of prospect theory (Kahneman & Tversky, 1979; Tversky & Kahneman, 1992). Prospect theory explains why humans make decisions about uncertain events that do not maximize their expected value. It formalizes how humans perceive random variables in a biased but well-defined manner; for example, relative to some reference point, humans are more sensitive to losses than gains, a property called loss aversion. We show that popular alignment methods such as DPO and PPO-Clip (Schulman et al., 2017) implicitly model some of these biases, helping explain their success independently of the data used ( 3.2). We then propose a more general class of such loss functions called human-aware losses (HALOs).1 1We use the term human-aware to draw an analogy with how hardware-aware methods benefit from being designed around hardware limitations (Dao et al., 2022), not to claim that a simple loss function is fully aware of human behavior. Model Alignment as Prospect Theoretic Optimization Although it is impossible to say that HALOs are categorically better than non-HALOs, we find that among existing methods, those that meet the definition of a HALO work better than those that do not ( 3.3). We find that DPO performance can even be matched at most scales by running an offline PPO variant on dummy +1/-1 rewards, suggesting that preference data might not be needed if the inductive bias in the loss function is good enough. However, despite the surprising success of this simple baseline, it significantly lags behind DPO at the 30B LLM scale and suffers from hyperparameter sensitivity, making it difficult to use. Taking a more principled approach, we derive a HALO using the model of human utility that Kahneman & Tversky proposed to describe how humans make decisions about uncertain monetary outcomes (Tversky & Kahneman, 1992). This approach, which we call Kahneman-Tversky Optimization (KTO), directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as most current methods do ( 4.1). KTO only requires a binary signal of whether an output is desirable or undesirable for an input. This data is more abundant, cheaper, and faster to collect in the real world, making it easier to scale alignment in production and rapidly iterate on models. We find that: KTO matches or exceeds DPO performance at scales from 1B to 30B parameters ( 4.2).2 That is, breaking up n preferences meant for DPO into 2n examples for KTO can yield better LLM generations, as determined by closed-ended tasks such as mathematical reasoning and open-ended judgments from humans and GPT-4. KTO can handle extreme data imbalances, matching DPO performance while using up to 90% fewer desirable examples (i.e., examples of good generations). Its success cannot be ascribed to the alignment data being sourced from a preference dataset. When the pretrained model is sufficiently good, one can skip supervised finetuning (SFT) and go straight to KTO without a loss in generation quality, whereas SFT is always needed for best results with DPO. The intent behind KTO was that even if the model learns from a weaker signal, we could compensate with the higher volume of data that could be accessed in practice; the fact that KTO can match and even outperform DPO on the same data is thus surprising. We conclude by discussing some theoretical explanations for this phenomenon ( 4.3). Despite the success of KTO in our experiments, our work ultimately suggests that there is no one HALO that is universally superior; the best HALO depends on the inductive biases appropriate for a given setting, and this choice should be made deliberately instead of defaulting to any one loss. 2Our code is available on Github; models are on Huggingface. 2. Background For a detailed survey, please see Appendix A. In brief, LLMs are traditionally trained in three stages (Ouyang et al., 2022): Pretraining Given a large corpus, train the model to maximize the log-likelihood of the next token conditioned on the preceding text. Let π0 denote the pretrained model. Supervised Finetuning (SFT) Finetune the model to predict the next token on data that is more relevant to the downstream task. Often, such data will comprise instructions and an appropriate response (i.e., instruction finetuning). Let πref denote the finetuned model. RLHF Given a dataset D of preferences (x, yw, yl) where x is an input, yw, yl are the preferred and dispreferred outputs (i.e., yw yl for x), and r is the true reward function underlying the preferences it is first assumed that the probability that yw is preferred to yl can be captured with a specific function class, typically a Bradley-Terry model (Bradley & Terry, 1952). Where σ is the logistic function: p (yw yl|x) = σ(r (x, yw) r (x, yl)) (1) Since getting the true reward from a human would be intractably expensive, a reward model rϕ learns to serve as a proxy, done by minimizing the negative log-likelihood of the human preference data: LR(rϕ) = Ex,yw,yl D[ log σ(rϕ(x, yw) rϕ(x, yl))] But solely maximizing the reward might come at the expense of desiderata such as generating grammatical text. To avoid this, a KL divergence penalty is introduced to restrict how far the language model can drift from πref. Where πθ is the model we are optimizing, the optimal model π is that which maximizes Ex D,y πθ[rϕ(x, y)] βDKL(πθ(y|x) πref(y|x)) (2) where β > 0 is a hyperparameter. Since this objective is not differentiable, we need to use an RL algorithm like PPO (Schulman et al., 2017). However, RLHF is often slow (largely because of having to sample generations) and quite unstable in practice (especially in a distributed setting). For this reason, recent work has focused on designing closed-form losses that maximize the margin between the preferred and dispreferred generations. In particular, Direct Preference Optimization (DPO) (Rafailov et al., 2023) has emerged as a popular alternative since it allows the same optimal policy as in RLHF to be recovered under certain conditions: LDPO(πθ, πref) = Ex,yw,yl D log σ β log πθ(yw|x) πref(yw|x) β log πθ(yl|x) Model Alignment as Prospect Theoretic Optimization 3. A Prospect Theoretic View of Alignment To understand why alignment methods work so well, we now frame them through the lens of prospect theory (Tversky & Kahneman, 1992). Prospect theory explains why, when faced with an uncertain event, humans make decisions that do not maximize their expected value. For example, because humans are loss-averse, given a gamble that returns $100 with 80% probability and $0 with 20% probability, a person might accept $60 to avoid the gamble, despite their certainty equivalent of $60 being less than the expected value of $80. 3.1. Prospect Theory In prospect theory, human utility depends on a value function and a weighting function:3 Definition 3.1. A value function v : Z R maps an outcome z, relative to some reference point z0, to its perceived (or subjective) value. For example, these functions capture the fact that humans tend to be more sensitive to relative losses than relative gains of the same magnitude. Definition 3.2. A weighting function ω is the derivative of a capacity function that maps cumulative probabilities to perceived cumulative probabilities. These functions capture, for example, the fact that humans tend to overestimate the chance of rare events. Let ωz denote the weight placed on outcome z. Definition 3.3. The utility of a random variable Z is a function of its outcomes: u(Z) P z Z ωzv(z z0). However, because humans do not see the full probability distribution of an LLM, weighting functions are not salient to this discussion; we will focus only on value functions. Using experiments that presented real humans with monetary gambles and asked for their certainty equivalent, Tversky & Kahneman (1992) proposed the following functional form for human value: v(z; λ, α, z0) = ( (z z0)α if z z0 λ(z0 z)α if z < z0 (4) where the median value of hyperparameter α = 0.88 and λ = 2.25 across individuals. α controls the curvature of the function, which reflects risk aversion; λ controls its steepness, which reflects loss aversion. While the shape of the median Kahneman-Tversky value function is illustrated in Figure 1, it should be noted that it varies across individuals (Tversky & Kahneman, 1992). There are also other functional forms for the value function that have been proposed in later work (Gurevich et al., 2009). The salient qualities of a value function are: the existence of a reference point that 3Cumulative prospect theory is the full name of the expanded theory we dicuss here (Tversky & Kahneman, 1992). is used to get the relative gain or loss; concavity in relative gains (i.e., diminishing sensitivity away from z0); and loss aversion (i.e., greater sensitivity to losses). Informally, HALOs are loss functions that reflect the model of human decision-making in Tversky & Kahneman (1992). Formally, Definition 3.4 (HALOs). Let θ denote the trainable parameters of the model πθ : X P(Y) being aligned, πref the reference model, l : Y R+ a normalizing factor, and rθ(x, y) = l(y) log[πθ(y|x)/πref(y|x)] the implied reward. Then f : X Y R is a human-aware loss if there exists a reference point distribution Q(X , Y |x, y) over X Y and a value function vf : R R that is non-decreasing everywhere and concave in (0, ) such that f is linear in vf(rθ(x, y) EQ[rθ(x , y )]) (5) In a classic prospect theory experiment, rθ would be the dollar amount assigned to each outcome; here, rθ is measured in nats, as the decrease in conditional surprisal when going from πref to πθ, normalized according to l. This follows naturally from the next-token prediction objective used to pretrain and finetune LLMs. As πθ is aligned, we would expect rθ to grow increasingly positive for desirable outputs and increasingly negative for undesirable outputs. Another perspective on the reward comes from the RLHF objective in (2). The policy that maximizes this objective has a closed-form expression (Peng et al., 2019; Peters & Schaal, 2007); where Z(x) is the partition function: π (y|x) = 1 Z(x)πref(y|x) exp 1 Letting l( ) = β, we get rθ (x, y) = r (x, y) β log Z(x) (6) Under θ , the HALO-defined reward is just the optimal reward shifted by an input-specific term, meaning that rθ is in the same equivalence class as r and would also induce the optimal policy π (Lemma 1, Rafailov et al. (2023)). The reference point in a HALO is the expected reward from the human s perspective, where Q(X , Y |x, y) describes the examples that are perceived as a baseline. We require that the value function be concave in gains but not necessarily convex in losses unlike the canonical Kahneman Tversky value function because a minority of individuals are risk-averse in both the gain and loss regime, resulting in concavity everywhere (Kahneman & Tversky, 1979). Theorem 3.5. DPO and PPO-Clip are human-aware losses. Model Alignment as Prospect Theoretic Optimization Figure 2. HALOs (DPO, offline PPO variant) outperform non-HALOs (SLi C, CSFT), as measured by the GPT-4-0613-judged winrate of the aligned model s generations against the outputs that would have been used for SFT. For better visualization, the y-axis here plots the winrate above chance (i.e., the winrate 50%). The difference between methods is only significant (p < 0.05) at 13B+ parameters, and only the HALO-aligned Llama-{13B, 30B} models are able to match the baseline and yield a winrate at or above chance. The proof is deferred to Appendix B. In Figure 1, we can see this more intuitively by plotting the value function for each loss (i.e., the implied human utility). 3.3. Does being a HALO matter? A natural question is whether modeling human biases has practical benefits. This is difficult to answer, since both HALOs and non-HALOs are diverse function classes, but we attempt to do so by comparing popular methods that qualify as a HALO with those that do not: 1. CSFT: Conditional SFT is a simple alignment method where a control token is prepended to the output during training; then, at inference, the control token corresponding to desirable generations (e.g., <|good|>) is appended to the input to induce good generations (Korbak et al., 2023). This is not a HALO. 2. SLi C: Sequence Likelihood Calibration (Zhao et al., 2023), which combines a max-margin loss for preferences with a language modeling loss, is not a HALO: LSLi C(πθ, πref) = Lcal(πθ) + λreg Lreg(πθ) Lcal = Ex,yw,yl D max 0, δ log πθ(yw|x) Lreg = Ex D,y πref(x)[ log πθ(y|x)] 3. DPO: DPO is a HALO (Theorem 3.5). 4. PPO (offline): The standard RLHF objective in (2) is typically optimized with PPO-Clip, which works by clipping how far πθ can drift from the version πold at the previous step: LPPO (offline) = Ex,y,t D[min(qθA(x:y rθ(x, ya). In contrast, the optimal KTO policy will deterministically produce the majority-preferred ya if the value function is loss-neutral (λD = λU), which is the default setting. This suggests that KTO has better worst-case outcomes when handling feedback intransitivity. 4.4. KTO vs. DPO when to use which? When human feedback is in a binary format, and especially when there is an imbalance between the number of desirable and undesirable examples, KTO is the natural choice. When your data is in the form of preferences, the choice is less clear. Putting aside the greater data efficiency of KTO, our theoretical analysis suggests that if your preference data has sufficiently little noise and sufficiently little intransitivity, then DPO will work better, since there is some risk of KTO underfitting; this risk can be mitigated by using a lower value of β and aligning with KTO for more epochs. But if there is enough noise and intransitivity, then the better worst-case guarantees of KTO will win out. Most publicly available preference datasets (e.g., SHP, Open Assistant) contain noisy feedback from many different humans whose preferences likely contradict to some extent, which explains why KTO was able to match or exceed DPO performance in our experiments. Even synthetic feedback can be noisy and intransitive, which helps explain why KTO outperforms DPO when aligning with Ultra Feedback. 5. Future Work The existence of HALOs raises many questions. For one, KTO is based on the Kahneman-Tversky value function for monetary gambles, which is almost certainly different from how humans perceive the relative goodness of text. What value functions and reference point distributions best describe how humans perceive language, and how do they vary across domains and individuals? How can we identify the best HALO for each individual and setting instead of using one default loss? On a more technical level, important directions include developing HALOs that: (1) incorporate granular feedback, such as a score, especially when optimizing for multiple desiderata; (2) work for other modalities (e.g. images) and model classes (e.g., diffusion models), especially models that do not produce an explicit distribution over the output space; (3) can resolve contradictions in feedback according to different definitions of fairness; (4) are designed to be used with online data, where the direction of feedback is implied by rθ or some external reward data. Ecologically valid evaluation (De Vries et al., 2020), where the aligned models are deployed in real-world settings, are also needed to judge the merits of different HALOs. 6. Conclusion Although model alignment has historically been rewardcentric, we found that the inductive biases of alignment objectives are critical to their success. Moreover, these inductive biases have analogs in the prospect theory literature, suggesting that they work in part because they reflect human biases in decision-making. We abstracted these insights into a family of alignment objectives called human-aware losses (HALOs). We then proposed a HALO called Kahneman Tversky Optimization (KTO) for directly maximizing the utility of generations instead of maximizing the likelihood of preferences, as existing methods do. Despite only learning from a binary signal of whether an output is (un)desirable, KTO was as good or better than preference-based methods in our experiments. More broadly, our work suggests that akin to how there is no one reward model that is universally superior, there is no one loss function either the best HALO depends on the inductive biases that are most appropriate for a given setting, and much work remains to be done in identifying the best HALO for each context. Model Alignment as Prospect Theoretic Optimization Acknowledgements We thank Percy Liang, Dilip Arumugam, Arya Mc Carthy, and Nathan Lambert for feedback. We thank Stas Bekman and Gautam Mittal for cluster assistance and Alex Manthey for helping with human evaluation. Impact Statement The methods discussed in this paper have the potential to make LLMs more helpful and safer, which is often needed for models deployed in production. It is possible that in making models more helpful, we increase the utility of one person at the expense of broader society. In aligning models with human feedback, one may also without even fully recognizing it be aligning to an unrepresentative subset of the population, which may hinder the ability of individuals outside that subset to benefit equally from using the model. The data used for LLM alignment, including the datasets used in this paper (e.g., SHP, HH, OASST) contain preferences of groups that are not representative of the broader population. Biases in this data have the potential to be propagated downstream when used to align models with methods like KTO, especially when no efforts are made to adjust for the different population. KTO in particular implicitly resolves contradictions in feedback by taking the majoritypreferred outcome for a loss-neutral value function, which does not comport with many theories of fairness (e.g., Rawlsianism). Since user preferences are, in turn, affected by the models they interact with, this also risks the homogenization of preferences and utility functions when KTO-aligned models are deployed at scale. The design of HALOs that resolve contradictions in more diverse ways is an important direction for future work. On the other hand, because KTO works with binary feedback, which is more abundant, cheaper, and faster to collect in the real world, it significantly lowers the barrier to data collection. This makes it easier to collect feedback from traditionally under-represented groups and serve different models to different users, instead of just one monolithic model being served to everyone. Azar, M. G., Guo, Z. D., Piot, B., Munos, R., Rowland, M., Valko, M., and Calandriello, D. A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pp. 4447 4455. PMLR, 2024. Baheti, A., Lu, X., Brahman, F., Le Bras, R., Sap, M., and Riedl, M. Improving language models with advantagebased offline policy gradients. In The Twelfth International Conference on Learning Representations, 2023. Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., Das Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. ar Xiv preprint ar Xiv:2204.05862, 2022. Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp. 2397 2430. PMLR, 2023. Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324 345, 1952. Busa-Fekete, R., Sz or enyi, B., Weng, P., Cheng, W., and H ullermeier, E. Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm. Machine learning, 97:327 351, 2014. Casper, S., Davies, X., Shi, C., Gilbert, T. K., Scheurer, J., Rando, J., Freedman, R., Korbak, T., Lindner, D., Freire, P., et al. Open problems and fundamental limitations of reinforcement learning from human feedback. Transactions on Machine Learning Research, 2023. Chan, L., Critch, A., and Dragan, A. Human irrationality: both bad and good for reward inference. ar Xiv preprint ar Xiv:2111.06956, 2021. Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. ar Xiv preprint ar Xiv:2107.03374, 2021. Chen, Z., Deng, Y., Yuan, H., Ji, K., and Gu, Q. Self-play fine-tuning converts weak language models to strong language models. ar Xiv preprint ar Xiv:2401.01335, 2024. Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017. Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. ar Xiv preprint ar Xiv:2110.14168, 2021. Cui, G., Yuan, L., Ding, N., Yao, G., Zhu, W., Ni, Y., Xie, G., Liu, Z., and Sun, M. Ultrafeedback: Boosting language models with high-quality feedback, 2023. Dao, T., Fu, D., Ermon, S., Rudra, A., and R e, C. Flashattention: Fast and memory-efficient exact attention with Model Alignment as Prospect Theoretic Optimization io-awareness. Advances in Neural Information Processing Systems, 35:16344 16359, 2022. De Vries, H., Bahdanau, D., and Manning, C. Towards ecologically valid research on language user interfaces. ar Xiv preprint ar Xiv:2007.14435, 2020. Ethayarajh, K., Choi, Y., and Swayamdipta, S. Understanding dataset difficulty with V-usable information. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 5988 6008. PMLR, 17 23 Jul 2022. Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. ar Xiv preprint ar Xiv:2209.07858, 2022. Gurevich, G., Kliger, D., and Levy, O. Decision-making under uncertainty a field study of cumulative prospect theory. Journal of Banking & Finance, 33(7):1221 1229, 2009. He, X., Liao, L., Zhang, H., Nie, L., Hu, X., and Chua, T.-S. Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web, pp. 173 182, 2017. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021. Hoeffler, S. and Ariely, D. Constructing stable preferences: A look into dimensions of experience and their impact on preference stability. Journal of consumer psychology, 8 (2):113 139, 1999. Holm, S. A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics, pp. 65 70, 1979. Hong, J., Lee, N., and Thorne, J. Reference-free monolithic preference optimization with odds ratio. ar Xiv preprint ar Xiv:2403.07691, 2024. Jain, A., Wojcik, B., Joachims, T., and Saxena, A. Learning trajectory preferences for manipulators via iterative improvement. Advances in neural information processing systems, 26, 2013. Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7b. ar Xiv preprint ar Xiv:2310.06825, 2023. Kahneman, D. and Tversky, A. Prospect theory: An analysis of decision under risk. Econometrica, 47(2):263 292, 1979. K opf, A., Kilcher, Y., von R utte, D., Anagnostidis, S., Tam, Z.-R., Stevens, K., Barhoum, A., Duc, N. M., Stanley, O., Nagyfi, R., et al. Openassistant conversations democratizing large language model alignment. ar Xiv preprint ar Xiv:2304.07327, 2023. Korbak, T., Shi, K., Chen, A., Bhalerao, R. V., Buckley, C., Phang, J., Bowman, S. R., and Perez, E. Pretraining language models with human preferences. In International Conference on Machine Learning, pp. 17506 17533. PMLR, 2023. Koren, Y., Bell, R., and Volinsky, C. Matrix factorization techniques for recommender systems. Computer, 42(8): 30 37, 2009. Kreutzer, J., Uyheng, J., and Riezler, S. Reliability and learnability of human bandit feedback for sequence-tosequence reinforcement learning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1777 1788, 2018. Kwon, M., Biyik, E., Talati, A., Bhasin, K., Losey, D. P., and Sadigh, D. When humans aren t optimal: Robots that collaborate with risk-aware humans. In Proceedings of the 2020 ACM/IEEE international conference on humanrobot interaction, pp. 43 52, 2020. Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., Liang, P., and Hashimoto, T. B. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/ alpaca_eval, 2023. Lu, X., Welleck, S., Hessel, J., Jiang, L., Qin, L., West, P., Ammanabrolu, P., and Choi, Y. Quark: Controllable text generation with reinforced unlearning. Advances in neural information processing systems, 35:27591 27609, 2022. Munos, R., Valko, M., Calandriello, D., Azar, M. G., Rowland, M., Guo, Z. D., Tang, Y., Geist, M., Mesnard, T., Michi, A., et al. Nash learning from human feedback. ar Xiv preprint ar Xiv:2312.00886, 2023. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730 27744, 2022. Model Alignment as Prospect Theoretic Optimization Peng, X. B., Kumar, A., Zhang, G., and Levine, S. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. ar Xiv preprint ar Xiv:1910.00177, 2019. Peters, J. and Schaal, S. Reinforcement learning by rewardweighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pp. 745 750, 2007. Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. In Thirtyseventh Conference on Neural Information Processing Systems, 2023. Rosset, C., Cheng, C.-A., Mitra, A., Santacroce, M., Awadallah, A., and Xie, T. Direct nash optimization: Teaching language models to self-improve with general preferences. ar Xiv preprint ar Xiv:2404.03715, 2024. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. ar Xiv preprint ar Xiv:2206.04615, 2022. Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P. F. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33: 3008 3021, 2020. Sun, L., Zhan, W., Hu, Y., and Tomizuka, M. Interpretable modelling of driving behaviors in interactive driving scenarios based on cumulative prospect theory. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pp. 4329 4335. IEEE, 2019. Swamy, G., Dann, C., Kidambi, R., Wu, Z. S., and Agarwal, A. A minimaximalist approach to reinforcement learning from human feedback. ar Xiv preprint ar Xiv:2401.04056, 2024. Tian, K., Mitchell, E., Yao, H., Manning, C. D., and Finn, C. Fine-tuning language models for factuality. ar Xiv preprint ar Xiv:2311.08401, 2023. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation language models. ar Xiv preprint ar Xiv:2302.13971, 2023. Tunstall, L., Beeching, E., Lambert, N., Rajani, N., Rasul, K., Belkada, Y., Huang, S., von Werra, L., Fourrier, C., Habib, N., Sarrazin, N., Sanseviero, O., Rush, A. M., and Wolf, T. Zephyr: Direct distillation of lm alignment, 2023. Tversky, A. and Kahneman, D. Availability: A heuristic for judging frequency and probability. Cognitive psychology, 5(2):207 232, 1973. Tversky, A. and Kahneman, D. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and uncertainty, 5:297 323, 1992. von Werra, L., Belkada, Y., Tunstall, L., Beeching, E., Thrush, T., Lambert, N., and Huang, S. Trl: Transformer reinforcement learning. https://github. com/huggingface/trl, 2020. Welleck, S., Kulikov, I., Roller, S., Dinan, E., Cho, K., and Weston, J. Neural text generation with unlikelihood training. In International Conference on Learning Representations, 2019. Xu, H., Sharaf, A., Chen, Y., Tan, W., Shen, L., Van Durme, B., Murray, K., and Kim, Y. J. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation. ar Xiv preprint ar Xiv:2401.08417, 2024. Yuan, W., Pang, R. Y., Cho, K., Sukhbaatar, S., Xu, J., and Weston, J. Self-rewarding language models. ar Xiv preprint ar Xiv:2401.10020, 2024. Zhao, Y., Joshi, R., Liu, T., Khalman, M., Saleh, M., and Liu, P. J. Slic-hf: Sequence likelihood calibration with human feedback. ar Xiv preprint ar Xiv:2305.10425, 2023. Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging llm-as-a-judge with mt-bench and chatbot arena. ar Xiv preprint ar Xiv:2306.05685, 2023. Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. Fine-tuning language models from human preferences. ar Xiv preprint ar Xiv:1909.08593, 2019. Model Alignment as Prospect Theoretic Optimization A. Related Work LLM Alignment Human feedback has been used to improve LLM capabilities in translation (Kreutzer et al., 2018), summarization (Stiennon et al., 2020), sentiment-conditioned generation (Ziegler et al., 2019), and instruction-following (Ouyang et al., 2022). The RLHF framework (Christiano et al., 2017; Bai et al., 2022) traditionally used to accomplish this is detailed in 2. Still, momentum has largely shifted in favor of closed-form losses that directly operate on offline preferences, such as DPO (Rafailov et al., 2023). This single stage of optimization distinguishes DPO from the conventional approach in preference-based RL, which learns a reward and then fits the policy to those rewards (Jain et al., 2013; Busa-Fekete et al., 2014). Other preference-based losses include CPO (Xu et al., 2024) and IPO (Azar et al., 2024). Binary Feedback Despite not being a human-aware loss, unlikelihood training was among the first methods to align language models using a binary signal (Welleck et al., 2019). However, Korbak et al. (2023) found unlikelihood training to be worse than the CSFT baseline we tested in this work, which is among various approaches that convert a binary/discrete signal into a control token (Lu et al., 2022). Learning from sparse binary feedback is a staple of information retrieval and recommender systems (He et al., 2017; Koren et al., 2009). Many retrieval-augmented generation systems use contrastive learning to ensure that generations are grounded. This can be framed as learning from synthetic binary feedback, although depending on the implementation, it may be the retriever and not the LLM that is updated (Lu et al., 2022). Online Alignment A recent string of work has centered on the idea of self-training or self-play , during which the policy is continually aligned on online data sampled from itself and then filtered (Chen et al., 2024; Yuan et al., 2024). Many of these approaches frame the learning of a preference model as a two-player min-max game between two policies (Munos et al., 2023; Rosset et al., 2024; Swamy et al., 2024). In theory, KTO can also be adapted for online alignment, though we leave this as a direction for future work. Prospect Theory Prospect theory, despite being influential in behavioral economics, has had a muted impact in machine learning, with work concentrated in human-robot interaction (Kwon et al., 2020; Sun et al., 2019; Chan et al., 2021). Theorem 3.5 (restated) DPO and PPO-Clip are human-aware loss functions. Proof. For a loss f to be a HALO, it needs to be linear in vf(rθ(x, y) EQ[rθ(x , y )]) where rθ(x, y) = l(x, y) log πθ(y|x) πref(y|x) is the implied reward (normalized by factor l(y)), Q(X , Y |x, y) is an inputconditioned reference point distribution, and vf : R R is a value function (in the prospect theoretic sense) that is non-decreasing everywhere and concave in (0, ). The DPO loss is LDPO(πθ, πref) = Ex,yw,yl log σ β log πθ(yw|x) πref(yw|x) β log πθ(yl|x) where β > 0 is a hyperparameter. DPO meets the criteria with the following construction: l(y) = β; rθ = β log πθ(y|x) πref(y|x); v DPO( ) = log σ( ) is increasing and concave everywhere; and Q places all mass on (x, yl), where yl is a dispreferred output for x such that y yl. The PPO-Clip loss is LPPO (offline) = Ex,y,t D[min(qθA(x:y 0. This gradient is simple to interpret: if y is desirable, then d(y) is negative and we push up the probability of πθ(y|x) to minimize the loss; if y is undesirable, then d(y) is positive and we push down the probability of πθ(y|x) to minimize the loss. As rθ tends to , the gradient will tend to zero since either (1 σ(βz)) or σ(βz) will tend to zero. Theorem 4.2 (restated) Assuming the value function is logistic, for a reward function r a that maximizes (2), there exists a reward function in its equivalence class (i.e., r b(x, y) = r a(x, y) + h(x) for some h(x)) that induces the same optimal policy π and the same Bradley-Terry preference distribution but a different human value distribution. Proof. Following the definition in Rafailov et al. (2023), we say r a and r b are in the same equivalence class if there exists some function h(x) such that r b(x, y) = r a(x, y) + h(x). From Lemma 1 in Rafailov et al. (2023), we know that two functions in the same equivalence class induce the same optimal policy: π ra(y|x) = 1 Z(x)πref(y|x) exp 1 β r a(x, y) y πref(y|x) exp 1 β r a(x, y) exp 1 β h(x) πref(y|x) exp 1 β r a(x, y) exp 1 y πref(y|x) exp 1 β (r a(x, y) + h(x)) πref(y|x) exp 1 β (r a(x, y) + h(x)) = π rb(y|x) For a Bradley-Terry model of preferences, it is trivial to show that p(yw yl|x) is unaffected by h(x) since it is added to the reward of both yw and yl. We will now show that the two reward functions do not necessarily induce the same distribution of human values. A Taylor series expansion of the human value of r a(x, y) would be: σ(0) + σ (0)(r a(x, y) EQ[r a(x, y)]) + σ (0) 2 (r a(x, y) EQ[r a(x, y)])2 + ... Model Alignment as Prospect Theoretic Optimization Choosing h(x) such that EQ[h(x)] = 0, a Taylor series expansion of the value of r a(x, y) + h(x) would be: σ(h(x)) + σ (h(x))(r a(x, y) EQ[r a(x, y)]) + σ (h(x)) 2 (r a(x, y) EQ[r a(x, y)])2 + ... Since σ is strictly monotonic, for these series to be equal, we must have h(x) = 0. If this is not the case, then the values of r a(x, y) and r b(x, y) will be different. Thus two arbitrary reward functions in the same equivalence class do not induce the same distribution of human values. Theorem 4.3 (restated) For input x with outputs {ya, yb}, let dataset D comprise contradictory preferences ya yb and yb ya in proportion p (0.5, 1) and (1 p) (0, 0.5) respectively. If p1/βπref(ya|x) < (1 p)1/βπref(yb|x), then the optimal DPO policy is more likely to produce the minority-preferred yb; the optimal KTO policy will strictly produce the majority-preferred ya for a loss-neutral value function (λD = λU). Proof. Where u = β(rθ(x, ya) rθ(x, yb)), we can write the total DPO loss for x as LDPO(x) = p( log σ(u)) + (1 p)( log σ( u)) Taking the derivative with respect to u and setting to zero, we get 0 = pσ(u)σ( u) σ(u) + (1 p)σ( u)σ(u) σ( u) = p(1 σ(u)) + (1 p)σ(u) = p + σ(u) = u = σ 1(p) βr θ(x, ya) = σ 1(p) + βr θ(x, yb) β log π θ(ya|x) πref(ya|x) = log p 1 p + β log π θ(yb|x) πref(yb|x) π θ(ya|x) = p 1 p 1/β πref(ya|x) πref(yb|x) π θ(yb|x) Thus when p1/βπref(ya|x) < (1 p)1/βπref(yb|x), we have π θ(ya|x) < π θ(yb|x), meaning the optimal DPO policy is more likely to produce the minority-preferred yb. Where ua = β(rθ(x, ya) EQ[rθ(x , y )]) and ub = β(rθ(x, yb) EQ[rθ(x , y )]), noting that 1 σ( u) = σ(u), we can write the total KTO loss for x as LKTO(x) = pλD(1 σ(ua)) + (1 p)λUσ(ua) + pλUσ(ub) + (1 p)λD(1 σ(ub)) = pλD + ((1 p)λU pλD)σ(ua) + (1 p)λD + (pλU (1 p)λD)σ(ub) = λD + ((1 p)λU pλD)σ(ua) + (pλU (1 p)λD)σ(ub) = λD + λD((1 2p)σ(ua) + (2p 1)σ(ub)) (under loss neutrality) Given that p > 0.5 by assumption and λD > 0 by definition, the KTO loss is decreasing in ua and increasing in ub and thus decreasing in rθ(x, ya) and increasing in rθ(x, yb) respectively. The optimal KTO policy is thus π θ(y|x) = 1[y = ya]. C. Implementations SLi C Instead of sampling from the reference model to calculate the Lreg as Zhao et al. (2023) do as it is very slow we just apply the cross-entropy loss to the SFT data, assuming that the reference model recovers the SFT distribution. DPO We use the implementation of DPO in the code provided by Rafailov et al. (2023). We found that, as mentioned in the original paper, β = 0.1 works best for most settings. Other training configurations, such as the learning rate and optimizer, were borrowed from the original paper. CSFT The control tokens used for generating the good and bad outputs are |good| and |bad| respectively, following the precedent set in Korbak et al. (2023). Model Alignment as Prospect Theoretic Optimization KTO We use a β = 0.1 in our experiments unless otherwise specified (the same setting as for DPO), as it is close-tooptimal for most settings. By default, λD = λU = 1. In experiments on imbalanced data subsampled from [SHP, HH, OASST], we found that setting λU, λD such that the effective ratio of desirable:undesirable examples was 4:3 worked best, regardless of which group was in the minority (see (8)). However, in running data-imbalanced experiments on Ultra Feedback, we found that an effective ratio of 1:1 worked best. The other hyperparameters (e.g., learning rate) are the same as in DPO. PPO PPO-Clip is the traditional means of optimizing the RLHF objective (2). However, most implementations of PPO-Clip for LLM alignment suffer from instability, particularly during distributed training. We find that running the PPO-Clip objective on offline data with the following tricks leads to much more stable training: We never update the reference distribution (i.e., the policy only takes one step in the trust region). Baheti et al. (2023) recommend this as well. To accommodate for this conservative change, we clip the probability ratios more liberally, finding that an asymmetric interval of [0.25, 4.0] works best instead of the small symmetrical interval (e.g., [0.8, 1.2]) that is traditionally recommended. Including a KL penalty (between the policy and reference distributions) in addition to the clipping makes training more stable, as is also done in the implementation by von Werra et al. (2020). We find that it is important to estimate the KL term not using the entire distribution but rather as the mean difference in the predicted log probabilities of the actual output tokens (i.e., the labels). We suspect that this makes a difference because the rest of the distribution can be poorly calibrated. The value of a state is generally predicted by some value head attached to the policy model; the value loss is the MSE between the predicted value and the discounted sum of future rewards for each token. This is a linear layer in many RLHF implementations (von Werra et al., 2020). However, we find that backpropagating the value loss through this head and the policy leads to worse performance. Instead, we make the value head a 3-layer MLP and detach it from the computational graph, so that the value losses are not backpropagated through the policy model but the value head still has sufficient capacity to learn good estimates. D. Human Evaluation For human evaluation, we randomly sampled 256 prompts from the Open Assistant test set and generated outputs from Mistral 7B models aligned with DPO and KTO. All inputs were multi-turn conversations between a user and an assistant, where the LLM played the role of the assistant (see Table 5 for an example) and the last turn in the input was that of the user. These were sent to a third-party data annotation service where a pool of workers picked either the generated output or the SFT target (from the Open Assistant dataset) as the more appropriate response by the assistant. Any questions that required specific domain experience (e.g., coding) were skipped, leading to 214 comparisons for DPO and KTO each. The winrates of the aligned model over the SFT targets are 72.9% 5.3 for KTO and 62.1% 5.7 for DPO (where the intervals are 90% binomial confidence intervals). In contrast, Table 2 contains the winrates when the same experiment is run with GPT-4 as a judge instead: 65.2% 3.6 for KTO and 60.0% 3.7 for DPO. Thus although there is no significant difference in the GPT-4-based evaluation, there is a significant difference with human evaluation at p < 0.05. We found that 68.7% of the individual human judgments concurred with the GPT-4 judgments for KTO; this number fell to 65.9% for DPO. E. Additional Experiments Model Alignment as Prospect Theoretic Optimization Table 3. KTO alignment with only desirable/positive examples (i.e., the same data that would be used for SFT), yields similar results to SFT when β is small; this is intuitive because it is less bound to the reference model. λD is fixed at 1. Dataset ( ) MMLU GSM8k Human Eval BBH Tydi QA Alpaca Eval 2 Avg. Setup ( ) 0-shot 8-shot, Co T 0-shot, Synth. 3-Shot, Co T 1-Shot, GP 0-shot Metric ( ) EM EM pass@1 EM F1 % Win SFT 57.2 1.0 54.4 33.2 25.4 7.17 29.7 KTO (β = 0.1) 57.8 12.5 53.2 29.5 28.0 5.46 31.1 KTO (β = 0.01) 57.3 1.5 53.7 33.4 26.5 5.31 29.6 Figure 6. Alpaca Eval 2 winrates of Mistral7B aligned with various methods on the Open Assistant dataset. GPT-4-turbo with chain-ofthought is used as the LM judge. KTO outperforms the SFT and DPO-aligned models at all temperatures. Table 4. Results from aligning Zephyr-β-SFT (Tunstall et al., 2023) on Ultra Feedback. Note that the base model had not been finetuned on Ultra Feedback but a similar dataset called Ultra Chat; the first row here denotes the finetuning results. All the methods below see exactly 1 epoch of Ultra Feedback, for an apples-to-apples comparison. Alpaca Eval 2 was run using GPT4 Turbo Co T as the LM judge. All evaluations were run with sampling temperature 0.7. Among the KTO variants: one-y-per-x denotes training standard KTO with either the positive or negative output (not both); no z0 refers to a constant reference point of 0; concave refers to a variant where the human value function is concave in the loss regime (i.e., minimize log( ) instead of 1 ( )); no πref refers to a reference model-free variant; risk-neutral refers to a variant where the human value function is just the identity function, implying risk neutrality. Dataset ( ) MMLU GSM8k Human Eval BBH Tydi QA Alpaca Eval 2 Avg. Setup ( ) 0-shot 8-shot, Co T 0-shot, Synth. 3-Shot, Co T 1-Shot, GP 0-shot Metric ( ) EM EM pass@1 EM F1 % Win SFT 57.2 39.0 30.1 46.3 36.3 6.4 35.9 DPO 58.2 40.0 30.1 44.1 36.5 7.8 36.1 ORPO (λ = 0.1) 57.1 36.5 29.5 47.5 38.5 5.0 35.7 KTO 58.6 53.5 30.9 52.6 31.2 12.5 39.9 KTO (one-y-per-x) 58.0 50.0 30.7 49.9 37.7 11.6 39.7 KTO (no z0) 58.5 49.5 30.7 49.0 29.0 12.1 38.1 KTO (concave) 58.3 42.5 30.6 43.2 36.6 7.4 36.4 KTO (no πref, λD = 1.75) 57.5 47.5 29.5 51.6 36.8 9.7 38.8 KTO (risk-neutral) 57.3 42.0 28.8 6.1 33.6 10.3 29.7 Model Alignment as Prospect Theoretic Optimization Prompt < |user| > I need some show recommendations that are similar to Breaking Bad. < |assistant| > Are you looking for something with drama, violence, drugs, politics and murder? < |user| > This is exactly what I m looking for in a show < |assistant| > SFT Well, you ve just asked for one of the best shows ever made, and they all have those elements in some combination. There are a lot of shows with those elements. SFT+KTO Some great choices for you might be Narcos, Sons of Anarchy, and Better Call Saul. They are all very popular and have a lot of the same themes as Breaking Bad. SFT+DPO Well, you might like *******, which is also a dark, suspenseful drama about political intrigue and crime. Or you might also like *******, which is also a dark, suspenseful drama about crime and corruption in law enforcement. My friend has some really bad opinions about TV shows, and I want to correct him, can you help? Sure! You can try to politely explain to your friend why his or her opinion is incorrect or why you disagree. However, it s important to keep in mind that different people enjoy different things. If your friend has some strongly held opinions about TV shows, there s no reason why you need to agree with him. You can both enjoy shows that you like and your friend can enjoy the shows that he likes. That s just the way it is. Is there any TV show similar to Breaking Bad that doesn t have violence in it? It s very hard to find a TV show that doesn t have any violence in it at all. However, there are some TV shows that are not focused on violence. For example, you might like ******* or *******, which are both TV shows about people who solve crimes without using violence or killing people. Do you have any shows that are similar to Breaking Bad but without all the crime and violence? Sure! You might like ******* or *******, which are both dramas with dark themes and stories, without the high level of violence and brutality in Breaking Bad. Or you might like *******, which is also a dark and suspenseful drama. Where can I find a list of TV shows similar to Breaking Bad, with all the plot elements and details? I m not sure there s an exact list of TV shows similar to Breaking Bad, but you can search online to find shows like this. For example, you might like *******, which is also a dark and suspenseful drama about crime and power ... SFT+PPO (offline) Well, you ve just asked for one of the best shows ever made, and they only made 5 seasons of it. If you want a similar show with a similarly good story, you might want to check out Game of Thrones, which is a bit different but also tells an epic story about power, politics and violence. Let me know if you need any more recommendations, it s no problem! Table 5. Sample generations from the different aligned versions of Llama-30B for a prompt about show recommendations (all models were aligned with data following the user-assistant format). Note that the SFT answer is not helpful and the SFT+DPO answer hallucinates multiple turns of the conversation (in fact, we had to truncate the answer shown here because the complete answer is too long). The SFT+PPO (offline) answer is helpful but only provides one recommendation, while SFT+KTO is succinct and provides multiple options.