# tokenlevel_direct_preference_optimization__d4569d62.pdf Token-level Direct Preference Optimization Yongcheng Zeng 1 2 Guoqing Liu 3 Weiyu Ma 1 2 Ning Yang 1 Haifeng Zhang 1 Jun Wang 4 Fine-tuning pre-trained Large Language Models (LLMs) is essential to align them with human values and intentions. This process often utilizes methods like pairwise comparisons and KL divergence against a reference LLM, focusing on the evaluation of full answers generated by the models. However, the generation of these responses occurs in a token level, following a sequential, auto-regressive fashion. In this paper, we introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level. Unlike previous methods, which face challenges in divergence efficiency, TDPO incorporates forward KL divergence constraints for each token, improving alignment and diversity. Utilizing the Bradley-Terry model for a token-based reward system, TDPO enhances the regulation of KL divergence, while preserving simplicity without the need for explicit reward modeling. Experimental results across various text tasks demonstrate TDPO s superior performance in balancing alignment with generation diversity. Notably, fine-tuning with TDPO strikes a better balance than DPO in the controlled sentiment generation and single-turn dialogue datasets, and significantly improves the quality of generated responses compared to both DPO and PPO-based RLHF methods. Our code is opensourced at https://github.com/Vance0124/Tokenlevel-Direct-Preference-Optimization. 1Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3Microsoft Research AI4Science 4University College London. Correspondence to: Jun Wang , Haifeng Zhang . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). 1. Introduction Large language models (LLMs) (Achiam et al., 2023; Bubeck et al., 2023) have demonstrated significant generalization capabilities in various domains including text summarization (Stiennon et al., 2022; Koh et al., 2022), coding writing (Chen et al., 2021; Gao et al., 2023), and even following human instructions (Chung et al., 2022; Ouyang et al., 2022). In order to align LLMs with human intentions, Reinforcement Learning from Human Feedback (RLHF) (Christiano et al., 2017; Ouyang et al., 2022; Dong et al., 2023; Yuan et al., 2023; Liu et al., 2023) has emerged as a highly effective method, embodying both stylistic and ethical values (Bai et al., 2022; Ganguli et al., 2022). These approaches typically involve the training of a reward model followed by the fine-tuning of the policy model using reinforcement learning (RL). Direct Preference Optimization (DPO) (Rafailov et al., 2023) introduces a straightforward and effective technique for training LLMs using pairwise comparisons, without the need for explicitly establishing a reward model. DPO utilizes KL divergence to ensure that the training process remains closely aligned with a reference Large Language Model (LLM), preventing significant deviations. In DPO, KL divergence is assessed at the sentence level, reflecting the fact that evaluations are based on complete responses (answers), typically comprising several sentences. However, the generation of these responses occurs sequentially, following an auto-regressive approach. A potential benefit is to examine divergence in relation to a reference LLM on a more granular, token-by-token basis. One approach involves using sequential KL divergence (as defined in Definition 4.3), which monitors the trajectory of the generated responses. As illustrated in Figure 1, DPO demonstrates a significantly faster increase in KL divergence within the subset of less preferred responses when compared to the subset that is preferred. This results in an expanding gap between the two subsets and also indicates that DPO does not effectively control the KL divergence of the dispreferred response subset. This impacts the model s divergence efficiency and ultimately affects its linguistic capabilities and generative diversity. Such a limitation highlights the decreased effectiveness of employing KL divergence within the DPO framework, suggesting an area for improvement in its methodology. Token-level Direct Preference Optimization 0 5000 10000 15000 20000 25000 step DSeq KL(x, yw; ref) 0 5000 10000 15000 20000 25000 step DSeq KL(x, yl; ref) 0 5000 10000 15000 20000 25000 step Figure 1. Sequential KL (Seq KL) divergence of both preferred response and dispreferred responses on IMDb dataset. Figure 1(a) shows the progression of Seq KL divergence on the preferred responses over training steps. Figure 1(b) depicts the evolution of Seq KL divergence on the dispreferred responses over the training steps. Figure 1(c) illustrates the difference between the Seq KL divergence of the dispreferred responses and that of the preferred responses during the training process, namely margin = |DSeq KL(x, yw; πref πθ) DSeq KL(x, yl; πref πθ)|. The definition of Seq KL divergence refers to Definition 4.3. The imbalance in the growth rates of the sequential KL divergence is potentially related to the reverse KL divergence constraint employed by DPO. The mode-seeking property of reverse KL divergence tends to induce diversity reduction during generation, limiting the model s potential to produce diverse and effective responses (Wiher et al., 2022; Khalifa et al., 2020; Glaese et al., 2022; Perez et al., 2022). Built upon DPO, the f-DPO method (Wang et al., 2023) studies the trade-off between alignment performance and generation diversity of LLMs under different divergence constraints. It highlights the advantages of the mass-covering behavior of forward KL divergence in enhancing model diversity and explores the impact of different divergence constraints. Nevertheless, f-DPO only independently discusses the changes in model behavior under either the reverse KL divergence or the forward KL divergence constraints. Essentially, it does not fundamentally enhance the DPO algorithm itself but rather strikes a balance between alignment performance and generating diversity by simply swapping different KL divergence constraints. Inspired by the aforementioned observations, we define and examine the problem of aligning with human preferences from a sequential and token-level standpoint. Some concurrent work has also been conducted in this direction (Rafailov et al., 2024; Zhong et al., 2024). We introduce a new method, referred to as Token-level Direct Preference Optimization (TDPO), which aims to strike a better balance between alignment performance and generation diversity by controlling the KL divergence for each token. In order to achieve this, we redefine the objective of maximising restricted rewards in a sequential manner. The connection between sentence-level reward and token-level generation is established by the use of the Bellman equation. Afterwards, the Bradley-Terry model (Bradley & Terry, 1952) is converted into a representation at the token level, demonstrating its close relationship with the Regret Preference Model (Knox et al., 2022; 2023). By utilizing this method, we effectively integrate forward KL divergence restrictions for each token in the final objective function, resulting in improved regulation of KL divergence. TDPO maintains the simplicity of DPO while offering improved regulation of KL divergence for aligning LLMs with human preferences. Echoing the strategy of DPO, our method directly optimizes the policy without necessitating explicit reward model learning or policy sampling throughout the training phase. Our experimental results demonstrate the effectiveness of TDPO across multiple text tasks, and gain a notable enhancement in the quality of generated responses in comparison to both DPO and PPO-based RLHF methods. In conclusion, TDPO stands out for its ability to not only effectively address the issue of excessive KL divergence but also greatly improve divergence efficiency. 2. Related Works The emergence of Chat GPT has catalyzed significant advancements in the field of Large Language Models (LLMs), such as Open AI s GPT-4 (Achiam et al., 2023), Mistral (Jiang et al., 2023), and Google s Gemini (Team et al., 2023). Generally, the training of LLMs involves three stages: initial unsupervised pre-training on massive text corpora to grasp linguistic structures (Raffel et al., 2020; Brown et al., 2020; Workshop et al., 2022; Touvron et al., 2023), followed by supervised fine-tuning with task-specific datasets to enhance the LLMs probability of producing desired responses (Taori et al., 2023; Chiang et al., 2023; Vu et al., 2023). However, due to the typically limited and expensive availability of labeled datasets during the supervised Token-level Direct Preference Optimization fine-tuning stage, the model may retain biases and inaccuracies, manifesting as societal biases (Sheng et al., 2021), ethical concerns (Weidinger et al., 2021), toxicity (Rauh et al., 2022), and hallucinations (Huang et al., 2023), which necessitates a subsequent AI alignment phase. Noteworthy models achieving significant alignment, such as Zephyr (Tunstall et al., 2023) and GPT-4 (Achiam et al., 2023), have demonstrated the effectiveness of techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) algorithms. Reinforcement Learning from Human Feedback (RLHF) has emerged as a cornerstone in aligning LLMs with human values, providing a mechanism to refine model outputs based on qualitative feedback (Christiano et al., 2017; Ouyang et al., 2022; Bai et al., 2022; Song et al., 2023; Touvron et al., 2023). This approach has shown considerable promise in making models more responsive to human expectations and ethical considerations by iteratively improving their performance through human-generated feedback. However, the complexity of implementing RLHF, compounded by the inaccuracies in human-generated reward models (Wu et al., 2023), has prompted the exploration of alternative strategies. Methods like Reward Ranked Fine Tuning (RAFT) (Dong et al., 2023) and Rank Responses to align Human Feedback (RRHF) (Yuan et al., 2023) offer streamlined approaches to alignment, circumventing some of RLHF s inherent challenges. Particularly, Direct Preference Optimization (DPO) (Rafailov et al., 2023) represents a breakthrough in direct policy optimization, addressing the intricacies of balancing model behavior through a nuanced approach to reward function optimization. Nevertheless, the challenge of maintaining linguistic diversity while aligning with human preferences remains a pivotal concern, prompting our proposed Token-level Direct Preference Optimization (TDPO), which seeks to harmonize the dual objectives of alignment accuracy and expressive range in model outputs. 3. Preliminaries For language generation, a language model (LM) is prompted with prompt (question) x to generate a response (answer) y, where both x and y consist of a sequence of tokens. Direct Preference Optimization (DPO) (Rafailov et al., 2023) commences with the RL objective from the RLHF: max πθ Ex D,y πθ( |x) r(x, y) βDKL πθ( | x) πref( | x) , (1) where D represents the human preference dataset, r(x, y) denotes the reward function, πref( |x) serves as a reference model, typically chosen the language model after supervised fine-tuning, πθ represents the model undergoing RL fine- tuning, initialized with πθ = πref, and β is the coefficient for the reverse KL divergence penalty. By directly deriving from Eq. 1, DPO establishes a mapping between the reward model and the optimal policy under the reverse KL divergence, obtaining a representation of the reward function concerning the policy: r(x, y) = β log πθ(y|x) πref(y|x) + β log Z(x). (2) Here, Z(x) is the partition function. To align with human preference, DPO uses the Bradley Terry model for pairwise comparisons: PBT(y1 y2|x) = exp(r(x, y1)) exp(r(x, y1)) + exp(r(x, y2)). (3) By substituting Eq. 2 into Eq. 3 and leveraging the negative log-likelihood loss, DPO derives the objective function: u(x, yw, yl) = β log πθ(yw | x) πref(yw | x) β log πθ(yl | x) πref(yl | x), LDPO(πθ; πref) = E(x,yw,yl) D [log σ (u(x, yw, yl))] , (4) and the derivative is given as follows: θLDPO(πθ; πref) = E(x,yw,yl) D [( u) θu] , (5) where u is the abbreviation of u(x, yw, yl), yw and yl denotes the preferred and dispreferred completion. 4. Methodology In this section, we initially reformulate the constrained reward maximization problem into a token-level form. From this, we derive the mapping between the state-action function and the optimal policy. Subsequently, we convert the Bradley-Terry model into token-level representation, establishing its equivalence with the Regret Preference Model. By substituting the mapping relationship into the reward model in token-level format, we obtain the optimization objective solely related to the policy. Finally, we conduct a formalized analysis of this optimization objective in terms of derivatives and, based on this, derive the ultimate loss function for TDPO. 4.1. Markov Decision Process under Token Rewards To model the sequential, auto-regressive generation, we extend the sentence-level formulation in Section 3 by considering that the response consists of T tokens y = y log πθ(yw|x) πref(yw|x), the value of ( u) will become larger, applying a stronger update for the comparison (yw, yl). While the second part δ is a distinctive component of our method. As shown in Figure 1, the KL divergence growth rate for the dispreferred response subset is faster than that for the preferred response subset. With the increasing disparity, the corresponding value of δ rises, thereby amplifying the weight factor ( u + δ). Combined with the subsequent gradient term, our objective function can effectively suppress the difference in KL divergence between pairs of responses with large disparities in KL divergence. Through the collaborative influence of the weight factor δ and the gradient term ( θδ), our method achieves the purpose of automatic control over the KL divergence balance. The gradient of the loss function in Eq. 16 also consists of two components, θu and ( θδ). θu represents the optimization direction of the gradient in DPO. Intuitively, θu increases the likelihood of preferred completions yw and decreases the likelihood of dispreferred com- Token-level Direct Preference Optimization pletions yl. While ( θδ) tends to narrow the gap between DSeq KL (x, yw; πref πθ) and DSeq KL (x, yl; πref πθ). However, when considered separately, the gradient of DSeq KL (x, yw; πref|πθ) in the loss function tends to increase the sequential KL divergence between πref and πθ at (x, yw) during the optimization process. This is because the sequential forward KL divergence in the loss function is introduced through the state-value function Vπ, inherently introducing an expectation Ez πref h log πθ(z|[x,y