# l3ms__lagrange_large_language_models__f460e419.pdf Published as a conference paper at ICLR 2025 L3MS LAGRANGE LARGE LANGUAGE MODELS Guneet S. Dhillon 1 , Xingjian Shi 2, Yee Whye Teh 1, Alex Smola 2 1 University of Oxford, 2 Boson AI {guneet.dhillon,y.w.teh}@stats.ox.ac.uk, {xingjian,smola}@boson.ai Supervised fine-tuning (SFT) and alignment of large language models (LLMs) are key steps in providing a good user experience. However, the concept of an appropriate alignment is inherently application-dependent, and current methods often rely on heuristic choices to drive optimization. In this work, we formulate SFT and alignment as a constrained optimization problem: the LLM is fine-tuned on a task while being required to meet application-specific requirements, without resorting to heuristics. To solve this, we propose Lagrange Large Language Models (L3Ms), which employ logarithmic barriers to enforce the constraints. This approach allows for the customization of L3Ms across diverse applications while avoiding heuristic-driven processes. We experimentally demonstrate the versatility and efficacy of L3Ms in achieving tailored alignments for various applications. 1 INTRODUCTION Large language models (LLMs) are used for a wide range of tasks: as chatbots (Brown et al., 2020; Open AI, 2024), for code generation (Ahmad et al., 2021; Wang et al., 2021; Rozi ere et al., 2024), for medical assistance (Yang et al., 2022; Moor et al., 2023), and so on. The key ingredients for their impressive downstream performance are supervised fine-tuning (SFT) and alignment; the former fine-tunes the LLM to a task of interest, while the latter instills it with preferential properties. Arguably, the right combination of preferential properties is highly application-dependent. For example, a scholar would want a chatbot to be honest and factual for assistance with their work. In contrast, a fiction writer might prefer the opposite behavior to help create fantastical imaginary worlds. There is also plenty of (anecdotal) evidence in support: some LLMs refuse to provide information on how to kill a process in Unix, recommending the use of less violent strategies for dealing with wayward computer programs instead.1 Therefore, we need frameworks for LLM customization. Consequently, Bai et al. (2022); Rame et al. (2023); Wu et al. (2023); Ji et al. (2023); Zhou et al. (2024) fine-tune LLMs on varying combinations of such preferential properties. In practice, one tends to resort to trial and error to find the right combination of preferences for their particular application. In doing so, one verifies if certain minimum baselines are satisfied, such as ensuring the factual correctness of statements or confirming that response lengths are capped at 100 words. Since there isn t a way to enforce such requirements directly, current methods resort to heuristics. Additionally, existing pipelines carry out SFT and alignment sequentially and must ensure that the LLM does not forget relevant task information learned during the SFT stage. This is achieved by penalizing the LLM for drastic deviations, with the strength of the penalty determined heuristically. In this work, we formulate SFT and alignment in LLMs as a constrained optimization problem. In particular, we fine-tune an LLM to minimize the task objective (the objective function) while simultaneously satisfying application-specific minimum requirements (the constraints). This merges the SFT and alignment stages and mitigates the reliance on heuristics altogether. Furthermore, we propose Lagrange Large Language Models, a.k.a. L3Ms, to solve such constrained optimization problems. Specifically, we do so by employing logarithmic barriers and gradually enforcing the constraints during the fine-tuning procedure. Lastly, we empirically demonstrate how one can pick and choose constraints and tailor L3Ms to a range of applications without resorting to heuristics. Work done while at Boson AI 1https://www.reddit.com/r/Local LLa MA/comments/180p17f/new_claude_21_ref uses_to_kill_a_python_process/ Published as a conference paper at ICLR 2025 Prompt What are the four main types of macromolecules found in living organisms? Response (short response L3M) 1. Nucleic acids (DNA and RNA) 2. Proteins 3. Carbohydrates 4. Lipids Response (long response L3M) 1. Nucleic acids, including DNA and RNA, which carry genetic information 2. Proteins, which are responsible for the structure, function, and regulation of living organisms 3. Lipids, which provide energy and form the membranes that surround cells 4. Carbohydrates, which are used as an energy source and as structural components of cells. Prompt Who discovered the law of universal gravitation? Response (short response L3M) Sir Isaac Newton is credited with discovering the law of universal gravitation in the late 17th century. Response (long response L3M) Sir Isaac Newton is widely credited with the discovery of the law of universal gravitation, which states that every particle in the universe attracts every other particle with a force proportional to their masses and inversely proportional to the square of the distance between them. Table 1: Example responses from length constrained L3Ms. We provide example responses from L3Ms with varying length constraints. We include the prompt along with the generated responses from two L3Ms; one constrained to have short responses and the other constrained to long ones. For example, Table 1 provides the generated responses from two such L3Ms; both are fine-tuned for instruction-following, but one is constrained to be concise, while the other is to be verbose. In summary, our contributions are as follows: 1. We formulate SFT and alignment in LLMs as a constrained optimization problem: an LLM is fine-tuned on a task while simultaneously satisfying custom requirements (cf. Section 4). 2. We propose L3Ms, a family of LLMs fine-tuned using the above framework (cf. Section 5). 3. We experimentally demonstrate how L3Ms can be customized to different applications and their specific requirements while avoiding heuristic-driven processes (cf. Section 6). 2 OPTIMIZATION FOR LLMS Training an LLM proceeds in multiple stages (Ouyang et al., 2022), which we discuss below. 2.1 PRE-TRAINING The pre-training stage instills the LLM with a generic knowledge of language. It entails regenerating text/token sequences by minimizing their perplexity, i.e., the negative log-likelihood of the sequence normalized by its length. More formally, the perplexity on a sequence x is defined as: lθ (x) = log πθ (x) i=1 log πθ (xi|x bharmless, then the constraint is not active, and the LLM will not be penalized. In contrast, the previous approach would further penalize the LLM. Published as a conference paper at ICLR 2025 Notation For ease of notation, we rewrite the constrained problem in Eqs. (3) and (4) as: min θ L (θ) subject to Ci (θ) 0 for all i {1, 2, . . . , k} , (5) with the objective L(θ) = E(x,y) p( )[lθ(y|x)] and constraints Ci(θ) = E(x, ) p( ) y πθ( |x) [bi ri(y|x)]. 4.1 TYPES OF CONSTRAINTS While we write the constraints in Eq. (4) as expectation/average constraints, other forms exist. For instance, uniform constraints impose a minimum reward on every generated (prompt, response) pair: ri (y|x) bi for all (x, ) p ( ) , y πθ ( |x) and all i {1, 2, . . . , k} . (6) Additionally, chance constraints bound the probability of the inequality holding away from zero: P(x, ) p( ) y πθ( |x) [ri (y|x) bi] 1 ϵi for all i {1, 2, . . . , k} . (7) These constraints are not equivalent, but they are related. We can rewrite Eq. (7) in the form of average constraints using 1 ϵi as the threshold and taking the expectation of the indicator 1{ri(y|x) bi}. Moreover, Eq. (6) implies Eq. (4), but the converse is not true. Unfortunately, Eq. (6) is difficult to achieve in practice, especially when the data distribution is unknown. We continue using expectation/average constraints, but similar discussions can extend to other types. 4.2 LAGRANGE MULTIPLIERS We can introduce Lagrange multipliers λi 0 for the constraints and obtain the Lagrangian: L (θ) = L (θ) + i=1 λi Ci (θ) . (8) There is a rich literature connecting the Lagrangian with constrained optimization. Notably, the KKT conditions (Karush, 1939; Kuhn & Tucker, 1951) provide sufficiency conditions for global optimality under convexity, where the solution is obtained by finding the saddle point of the Lagrangian. However, these conditions are not enough for highly non-convex scenarios such as ours. Nevertheless, the Lagrangian is instructive in understanding the relative importance of the constraints. For an active constraint, i.e., one satisfied with equality, the corresponding Lagrange multiplier can be non-zero; the larger its value, the more important the constraint. Conversely, for an inactive constraint, i.e., one satisfied with strict inequality, the corresponding Lagrange multiplier must vanish to 0. This is known as complementary slackness and is one of the KKT conditions. 4.3 LOGARITHMIC BARRIER A practical way to enforce constraints is with barrier functions. Consider the (relaxed) log barrier: Bµ,s (z) = µ log ( z) , z s µ s z + µ µ log s, z > s , and hence z Bµ,s (z) = µ max ( z, s), (9) with parameters µ, s > 0. This is a convex, continuous, and differentiable function, which is valid for all z R. Importantly, for s = µ2, this barrier function converges to the characteristic function χ{z 0} as µ 0, i.e., it takes the value 0 when z 0 and otherwise (Tal et al., 1992; Nash et al., 1994; Hauser & Saccon, 2006; Feller & Ebenbauer, 2017); the condition s = µ2 is sufficient, but not necessary (Kervadec et al., 2022). This convergence to the characteristic function is visually depicted in Fig. 1, showing the change in the log barrier function as we gradually decrease µ. We can now use the log barrier to enforce the constraints in Eq. (5) and simply add them to the objective. We obtain an unconstrained objective, with µ controlling the strength of the constraints: Gµ (θ) = L (θ) + 1 i=1 Bµ,µ2 (Ci (θ)) . (10) Published as a conference paper at ICLR 2025 1.2 1.0 0.8 0.6 0.4 0.2 0.0 z Figure 1: The relaxed logarithmic barrier. We depict the convergence of the relaxed logarithmic barrier Bµ,µ2(z) to the characteristic function χ{z 0} as µ 0. We gradually decrease µ from 1 (blue) to 0.01 (red). Consequently, Bµ,µ2(z) gets closer to 0 for z 0 and increases to otherwise. 5 LAGRANGE LARGE LANGUAGE MODELS (L3MS) Thus far, we have formulated the SFT and alignment stages as a constrained optimization problem in Eq. (5). We proceed to find solutions for the same by solving the unconstrained objective in Eq. (10). We call the family of models obtained in this way L3Ms, i.e., Lagrange Large Language Models. 5.1 OPTIMIZATION PROCEDURE Since the log barrier converges to the characteristic function as µ 0, we want to find the minimizer of Gµ(θ) for a very small µ. However, doing so directly leads to instabilities as the objective function is ill-conditioned. Instead, it is common practice to follow an iterative procedure: one finds the minimizer for a fixed µ, reduces µ, and repeats (Curtis et al., 2024). Specifically, the procedure is instantiated with initial values θ0, µ0, and 0 < γ < 1. On the t-th iteration, µt γµt 1 is reduced and θt argθ min Gµt(θ) (with initialization θt 1). In doing so, the constraints are gradually enforced, nudging the LLM to satisfy them over the optimization procedure while avoiding instabilities. As {µt} 0, the weights {θt} converge to the minimizer of the constrained problem. It is impossible to minimize Gµt(θ) exactly in many practical applications. Instead, at each iteration, one can take a single optimization step toward the solution. Doing so is amenable to stochastic gradient methods and mitigates computational overhead: the optimization proceeds as normal while the value of µ is reduced over the course of the procedure. One can guarantee the convergence of this procedure to the optimal solution in some settings; for example, Curtis et al. (2024) prove convergence when dealing with box constraints. However, convergence in a scenario like ours is not guaranteed. Nevertheless, we will experimentally demonstrate its use for our constrained problems. We employ stochastic gradient methods and derive the gradient of our objective function directly: θGµ (θ) = θL (θ) + µ θCi (θ) max ( Ci (θ) , µ2). (11) This follows immediately from Eqs. (9) and (10). Note that the gradients θCi(θ) s are also known as policy gradients in reinforcement learning literature. We discuss our strategy for estimating these gradients in Appendix B and refer readers to Schulman et al. (2016) for a more detailed review. 5.2 CONNECTION TO LAGRANGE MULTIPLIERS The log barrier and the Lagrangian are intrinsically connected; this becomes evident when comparing Eq. (11) with the (gradient of the) Lagrangian in Eq. (8). In particular, we define the multipliers: ˆλi = µ k max ( Ci (θ) , µ2), Published as a conference paper at ICLR 2025 corresponding to the gradients θCi(θ) s in Eq. (11). They can be interpreted as Lagrange multipliers: for active constraints, ˆλi = 1/kµ is non-zero; for inactive constraints, ˆλi = µ/k Ci(θ) vanishes to 0 as µ 0. Hence, the KKT complementary slackness condition is satisfied by design. 5.3 MEMORY AND TIME COMPLEXITY L3Ms differ from traditional LLMs only in the fine-tuning process. In fact, L3Ms require less memory and are faster to fine-tune compared to traditional LLMs. This comes from merging the SFT and alignment stages as we did in our constrained optimization formulation in Eq. (5). We minimize the task objective directly and avoid the need to compute deviations away from the SFT model (for instance, to impose the KL divergence penalty). As a result, we save on loading the SFT model into memory and evaluating it on generated responses at each optimization step. Also, the L3M s log barrier parameter µ is adjusted during fine-tuning itself, without needing any extra steps. 5.4 IMPLEMENTATION DETAILS Alternating the objective and gradient clipping Gradient clipping is a simple yet effective way to ensure stable training of large models (Goodfellow et al., 2016; Zhang et al., 2020). We employ this technique, albeit with the modification of clipping the gradients of both the task objective and the constraints separately, as they can have varying magnitudes. We achieve this by alternating between reducing the task objective and enforcing the constraints by flipping a fair coin to select one or the other. While this doubles the number of steps to achieve the same effect, it does not increase the amount of work done as now only one part of the objective or the other is evaluated at each step. Length normalization The gradient of our objective function in Eq. (11) involves the LLM s loglikelihoods on the generated responses through the gradients θCi(θ) s (cf. Eq. (12)). To avoid a response length bias, we length-normalize the log-likelihoods, akin to the definition of perplexity. Estimating the mean preference rewards We need to estimate the expectations involved in the gradient of our objective function in Eq. (11). The expectations in the numerators can be estimated with the per-mini-batch Monte Carlo averages (Mohamed et al., 2020). However, Ci(θ) s in the denominator need careful consideration. Note that: (i) Ci(θ) does not involve the gradient, so its estimate can include information from previous mini-batches to reduce the estimate s variance, and (ii) since the weight θ is updated during fine-tuning, Ci(θ) is non-stationary. Hence, we use an exponential moving average estimate for the mean (offset) preference rewards Ci(θ) in the denominator. 6 EXPERIMENTAL RESULTS To illustrate the customization of L3Ms, we empirically evaluate them on: (i) satisfaction of the imposed constraints, and (ii) minimization of the task objective. Our code, based on the Transformers library (Wolf et al., 2020), is available at: https://github.com/Guneet-Dhillon/l3m. We use LLa MA-7B (Touvron et al., 2023) for all our experiments, as it is a lightweight LLM pretrained on a large corpus. We are interested in the task of instruction-following, for which we use Ultra Chat (Ding et al., 2023), a large-scale dataset of instructional conversations. We run all experiments on NVIDIA H100s. Further details of our experimental setup are included in Appendix A. We refer to the LLM fine-tuned to minimize the SFT objective only (without the alignment stage) as the SFT model. We fine-tune LLMs using the minimax approach to find a saddle point of the Lagrangian, as proposed by Moskovitz et al. (2024); Dai et al. (2024), and refer to them as MMs. Lastly, we refer to the LLMs fine-tuned using our proposed approach as L3Ms. In what follows, we use different preference reward functions and vary the custom constraint requirements. All results are obtained on a held-out test dataset (not seen during training or validation). Published as a conference paper at ICLR 2025 LLM type Length Perplexity SFT 121.6 0.805 0.3 L3M [ 50 , 100 ] 81.3 0.804 0.3 L3M [100 , 150 ] 120.7 0.804 0.3 L3M [ 50 , 75 ] 64.4 0.807 0.3 L3M [ 75 , 100 ] 88.2 0.808 0.3 L3M [100 , 125 ] 111.7 0.810 0.3 L3M [125 , 150 ] 126.5 0.809 0.3 L3M [ 75 , 87.5] 82.9 0.811 0.3 L3M [ 87.5, 100 ] 92.7 0.809 0.3 L3M [100 , 112.5] 104.8 0.810 0.3 L3M [112.5, 125 ] 117.3 0.810 0.3 Response length LLM type SFT L3M for [050, 100] L3M for [100, 150] L3M for [050, 075] L3M for [075, 100] L3M for [100, 125] L3M for [125, 150] L3M for [075, 087.5] L3M for [087.5, 100] L3M for [100, 112.5] L3M for [112.5, 125] Figure 2: Length constrained L3Ms. We report the response lengths (in tokens) and task perplexities of the SFT model and the L3Ms with varying length constraints. Left: The mean response length with the mean and standard deviation of the task perplexities. Right: The distribution of the response lengths. The notches indicate the medians and their 95% confidence intervals, the boxes show the 25% quantiles, and the whiskers denote the 1.5 interquartile ranges. The white circles mark the means, and the black dashed lines depict the constraints imposed on the different L3Ms. 6.2 LENGTH CONSTRAINED L3MS Consider tasks in which the lengths of the responses need to be contained in the range [llow, lhigh] to control verbosity; for example, in summarization tasks (Makino et al., 2019). In this case, the natural choice for the reward functions compute the response length and its negation: r1(y|x) = |y| and r2(y|x) = |y|. Furthermore, these rewards are to be controlled with the minimum requirements of llow and lhigh, respectively. Note that these reward functions are perfectly negatively correlated. If we naively average the rewards, any unconstrained formulation of alignment (including RLHF) will be ineffective, as the loss will always vanish due to the anti-correlation. We could use a weighted average and tune the weights heuristically, but this is tedious. Instead, we use the constrained formulation and directly constrain the rewards r1(y|x) = |y| llow and r2(y|x) = |y| lhigh. We fine-tune several L3Ms with varying length constraints. We illustrate the distributions of the generated response lengths (in tokens) and report the perplexities achieved on the task-related data in Fig. 2. We observe that the mean response lengths are in the required range in each case, satisfying the imposed average constraints. Additionally, the task perplexities increase slightly as the constraints on the response lengths are made more stringent. However, there is little to no degradation relative to the SFT model, with all mean task perplexities being within 0.02 standard deviations. The examples included in Table 1 are generated from such L3Ms. While all the responses correctly answer the prompts, their lengths vary, corresponding to the length constraints imposed on them. 6.3 HELPFUL AND HARMLESS L3MS Next, we consider the Helpful and Harmless (HH) preferences that have been extensively used in the LLM alignment literature (Ji et al., 2023; Wang et al., 2024; Zhou et al., 2024; Guo et al., 2024). Specifically, we utilize the datasets by Bai et al. (2022) to train two preference reward functions, respectively. These learned reward functions are negatively correlated (Bai et al., 2022; Dai et al., 2024). Furthermore, note that the numerical outputs of both these reward functions are interpreted as ratings such that a higher numerical value indicates higher helpfulness/harmlessness and vice versa. We fine-tune several L3Ms with varying HH constraints. We compare our L3M approach of using log barriers with that of the minimax optimization by Moskovitz et al. (2024); Dai et al. (2024) to find a saddle point of the Lagrangian (MMs). In our experience, learning the Lagrange multipliers with the latter is extremely sensitive to the choice of the learning rate. Moreover, to avoid loading an additional SFT model during fine-tuning, as is done by Moskovitz et al. (2024); Dai et al. (2024), our implementation of MM minimizes the task objective directly (as is done by L3Ms as well). Published as a conference paper at ICLR 2025 Harmlessness Helpfulness Init SFT MM L3M Constraints Perplexity Helpful Harmless MM L3M 4 0 0.804 0.3 4 1 0.809 0.3 4 2 0.822 0.3 0.818 0.3 3 3 0.822 0.3 0.814 0.3 2 4 0.825 0.3 0.820 0.3 1 4 0.817 0.3 0 4 0.816 0.3 Figure 3: Helpful and harmless L3Ms. We report the helpful-harmless rewards and task perplexities achieved by the different LLMs. Left: The helpful-harmless rewards attained by the LLM at initialization (at the bottom-left in blue), the SFT model (at the top-left in orange), the MMs (in green), and the L3Ms (in red). We depict the imposed constraints in black, with the dotted gray lines connecting LLMs to their corresponding constraints. Note that constraints are satisfied if the obtained reward point is at the top-right of its corresponding constraint point. For example, the shaded region denotes the feasible region for the constraint point (3, 3), with the shade gradient denoting the distance from the constraint boundary (light to dark shows an increase in distance). Right: The mean and standard deviation of the task perplexities for MMs and L3Ms, along with their corresponding constraints; the task perplexity at initialization is 1.316 0.4 and that of the SFT model is 0.805 0.3. Fig. 3 shows the achieved task perplexities and helpful-harmless rewards for the different LLMs. At initialization, the helpful-harmless rewards are both low, with a high task perplexity of 1.316 0.4. This is improved upon by the SFT model, reducing the perplexity to 0.805 0.3 and attaining a high helpfulness reward (due to the task data instilling instruction-following capabilities). Furthermore, MMs sacrifice task performance to oversatisfy the constraints, with mean task perplexities 0.822. Conversely, L3Ms satisfy the imposed helpful-harmless reward constraints with consistently lower task perplexities (the mean perplexities are in the range 0.804-0.820). We attribute this to the L3Ms having better Lagrange multipliers by design rather than learning them as in MMs (cf. Section 5.2). Note that here we evaluate the LLMs on reward constraint satisfaction and task objective minimization. Furthermore, one can increase the constraint minimum baselines to obtain higher rewards. 7 CONCLUSIONS In this work, we formulate SFT and alignment in LLMs as constrained optimization: we minimize the task objective while simultaneously imposing application-specific constraints on preferences. This enables the customization of LLMs to different preferential properties while maintaining performance on the task of interest. Consequently, we propose Lagrange Large Language Models (L3Ms) to solve this constrained optimization problem by incorporating the constraints in the objective using the logarithmic barrier. We include experimental results to illustrate the customization qualities of L3Ms, which can fit to different preferences, providing a personalized user experience. Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Unified pre-training for program understanding and generation. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Published as a conference paper at ICLR 2025 Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2655 2668, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.211. URL https://aclanthology.org/2021.naacl-main.211. 1 Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova Das Sarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam Mc Candlish, Chris Olah, Ben Mann, and Jared Kaplan. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022. URL https://arxiv.org/abs/2204.05862. 1, 3, 4, 8, 14 Michiel Bakker, Martin Chadwick, Hannah Sheahan, Michael Tessler, Lucy Campbell-Gillingham, Jan Balaguer, Nat Mc Aleese, Amelia Glaese, John Aslanides, Matt Botvinick, and Christopher Summerfield. Fine-tuning language models to find agreement among humans with diverse preferences. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 38176 38189. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/fi le/f978c8f3b5f399cae464e85f72e28503-Paper-Conference.pdf. 3 Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: The method of paired comparisons. Biometrika, 39(3-4):324 345, 12 1952. ISSN 0006-3444. doi: 10.1093/ biomet/39.3-4.324. URL https://doi.org/10.1093/biomet/39.3-4.324. 14 Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877 1901. Curran Associates, Inc., 2020. URL https://proceedings. neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac 142f64a-Paper.pdf. 1 Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, J er emy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Tong Wang, Samuel Marks, Charbel-Raphael Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Biyik, Anca Dragan, David Krueger, Dorsa Sadigh, and Dylan Hadfield-Menell. Open problems and fundamental limitations of reinforcement learning from human feedback. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/f orum?id=bx24Kp J4Eb. Survey Certification. 4 Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.ne urips.cc/paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0c d4e49-Paper.pdf. 3 Frank E. Curtis, Vyacheslav Kungurtsev, Daniel P. Robinson, and Qi Wang. A stochastic-gradientbased interior-point algorithm for solving smooth bound-constrained optimization problems, 2024. URL https://arxiv.org/abs/2304.14907. 6 Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe RLHF: Safe reinforcement learning from human feedback. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/f orum?id=Ty Fr POKYXw. 4, 7, 8 Published as a conference paper at ICLR 2025 Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 3029 3051, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.183. URL https://aclanthology.org/2023.emnlp-main.183. 7, 14 Christian Feller and Christian Ebenbauer. Relaxed logarithmic barrier function based model predictive control of linear systems. IEEE Transactions on Automatic Control, 62(3):1223 1238, 2017. doi: 10.1109/TAC.2016.2582040. 5 Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 10835 10866. PMLR, 23 29 Jul 2023. URL https://proceedings.mlr.press/v202/gao23h.html. 3 Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. URL http://www.deeplearningbook.org. 7 Shane Griffith, Kaushik Subramanian, Jonathan Scholz, Charles L Isbell, and Andrea L Thomaz. Policy shaping: Integrating human feedback with reinforcement learning. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceeding s.neurips.cc/paper_files/paper/2013/file/e034fb6b66aacc1d48f445d dfb08da98-Paper.pdf. 3 Yiju Guo, Ganqu Cui, Lifan Yuan, Ning Ding, Zexu Sun, Bowen Sun, Huimin Chen, Ruobing Xie, Jie Zhou, Yankai Lin, Zhiyuan Liu, and Maosong Sun. Controllable preference optimization: Toward controllable multi-objective alignment. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 1437 1454, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.85. URL https: //aclanthology.org/2024.emnlp-main.85/. 8 John Hauser and Alessandro Saccon. A barrier function method for the optimization of trajectory functionals with constraints. In Proceedings of the 45th IEEE Conference on Decision and Control, pp. 864 869, 2006. 5 Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beaver Tails: Towards improved safety alignment of LLM via a human-preference dataset. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 24678 24704. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/pap er_files/paper/2023/file/4dbb61cb68671edc4ca3712d70083b9f-Paper -Datasets_and_Benchmarks.pdf. 1, 3, 4, 8 William Karush. Minima of functions of several variables with inequalities as side conditions. Master s thesis, Department of Mathematics, University of Chicago, Illinois, USA, 1939. 5 Hoel Kervadec, Jose Dolz, Jing Yuan, Christian Desrosiers, Eric Granger, and Ismail Ben Ayed. Constrained deep networks: Lagrangian optimization via log-barrier extensions. In 2022 30th European Signal Processing Conference (EUSIPCO), pp. 962 966, 2022. doi: 10.23919/EUSIP CO55093.2022.9909927. 5 W. Bradley Knox and Peter Stone. TAMER: Training an agent manually via evaluative reinforcement. In 2008 7th IEEE International Conference on Development and Learning, pp. 292 297, 2008. doi: 10.1109/DEVLRN.2008.4640845. 3 Harold W. Kuhn and Albert W. Tucker. Nonlinear programming. In J. Neyman (ed.), Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, pp. 481 492, Berkeley, California, USA, 1951. University of California Press. 5 Published as a conference paper at ICLR 2025 Kaiwen Li, Tao Zhang, and Rui Wang. Deep reinforcement learning for multiobjective optimization. IEEE Transactions on Cybernetics, 51(6):3103 3114, 2021. doi: 10.1109/TCYB.2020.2977661. 3 Takuya Makino, Tomoya Iwakura, Hiroya Takamura, and Manabu Okumura. Global optimization under length constraint for neural text summarization. In Anna Korhonen, David Traum, and Llu ıs M arquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1039 1048, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1099. URL https://aclanthology.org/P19-1099. 8 Shakir Mohamed, Mihaela Rosca, Michael Figurnov, and Andriy Mnih. Monte Carlo gradient estimation in machine learning. Journal of Machine Learning Research, 21(132):1 62, 2020. URL http://jmlr.org/papers/v21/19-346.html. 7 Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M. Krumholz, Jure Leskovec, Eric J. Topol, and Pranav Rajpurkar. Foundation models for generalist medical artificial intelligence. Nature, 616(7956):259 265, Apr 2023. ISSN 1476-4687. doi: 10.1038/s41586-023-058 81-4. URL https://doi.org/10.1038/s41586-023-05881-4. 1 Ted Moskovitz, Aaditya K Singh, DJ Strouse, Tuomas Sandholm, Ruslan Salakhutdinov, Anca Dragan, and Stephen Marcus Mc Aleer. Confronting reward model overoptimization with constrained RLHF. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=gkf Uvn0f LU. 4, 7, 8 Stephen G. Nash, R. Polyak, and Ariela Sofer. A Numerical Comparison of Barrier and Modified Barrier Methods For Large-Scale Bound-Constrained Optimization, pp. 319 338. Springer US, Boston, MA, 1994. ISBN 978-1-4613-3632-7. doi: 10.1007/978-1-4613-3632-7 16. URL https://doi.org/10.1007/978-1-4613-3632-7_16. 5 Open AI. GPT-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774. 1 Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 27730 27744. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/b 1efde53be364a73914f58805a001731-Paper-Conference.pdf. 2 Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 53728 53741. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/a 85b405ed65c6477a4fe8302b5e06ce7-Paper-Conference.pdf. 14 Alexandre Rame, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. Rewarded soups: Towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 71095 71134. Curran Associates, Inc., 2023. URL https://proceedings.neurip s.cc/paper_files/paper/2023/file/e12a3b98b67e8395f639fde4c2b0316 8-Paper-Conference.pdf. 1, 3, 4 Baptiste Rozi ere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, J er emy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre D efossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. Code LLa MA: Open foundation models for code, 2024. URL https://arxiv.org/abs/2308.12950. 1 Published as a conference paper at ICLR 2025 John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations (ICLR), 2016. 6, 15 A Ben Tal, M Tsibulevskii, and I Yusefovich. Modified barrier methods for constrained and minimax problems. Technical report, Optimization Laboratory, Israel Institute of Technology, 1992. 5 Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ee Lacroix, Baptiste Rozi ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLa MA: Open and efficient foundation language models, 2023. URL https://arxiv.org/abs/2302.13971. 7, 14 Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. Code T5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 8696 8708, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.685. URL https://aclanthology.org/2021.em nlp-main.685. 1 Zihao Wang, Chirag Nagpal, Jonathan Berant, Jacob Eisenstein, Alexander Nicholas D Amour, Sanmi Koyejo, and Victor Veitch. Transforming and combining rewards for aligning large language models. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp. 51161 51176. PMLR, 21 27 Jul 2024. URL https://proceedings.mlr.press/ v235/wang24ay.html. 8 Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-art natural language processing. In Qun Liu and David Schlangen (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38 45, Online, October 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp -demos.6. URL https://aclanthology.org/2020.emnlp-demos.6. 7, 14 Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A Smith, Mari Ostendorf, and Hannaneh Hajishirzi. Fine-grained human feedback gives better rewards for language model training. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 59008 59033. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/pap er_files/paper/2023/file/b8c90b65739ae8417e61eadb521f63d5-Paper -Conference.pdf. 1, 3, 4 Xi Yang, Aokun Chen, Nima Pour Nejatian, Hoo Chang Shin, Kaleb E. Smith, Christopher Parisien, Colin Compas, Cheryl Martin, Anthony B. Costa, Mona G. Flores, Ying Zhang, Tanja Magoc, Christopher A. Harle, Gloria Lipori, Duane A. Mitchell, William R. Hogan, Elizabeth A. Shenkman, Jiang Bian, and Yonghui Wu. A large language model for electronic health records. npj Digital Medicine, 5(1):194, Dec 2022. ISSN 2398-6352. doi: 10.1038/s41746-022-00742-2. URL https://doi.org/10.1038/s41746-022-00742-2. 1 Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=BJgn Xp VYw S. 7 Zhanhui Zhou, Jie Liu, Jing Shao, Xiangyu Yue, Chao Yang, Wanli Ouyang, and Yu Qiao. Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics ACL 2024, pp. 10586 10613, Bangkok, Thailand and virtual meeting, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.630. URL https://aclanthology.org/2024.findings-acl.630. 1, 4, 8 Published as a conference paper at ICLR 2025 A EXPERIMENTAL SETUP In addition to the experimental setup discussed in Section 6.1, we provide further details here. Task data We use Ultra Chat (Ding et al., 2023), a large-scale dataset of instructional conversations, as our task data to induce instruction-following capabilities. Since each sample contains a sequence of multi-turn question-answer pairs, we randomly sample one of the answers as the response and treat the preceding dialogue as the prompt. We then filter such (prompt, response) pairs to a maximum token length of 512. Consequently, we obtain 340k training samples, 1.7k validation samples, and 1.7 test samples, split randomly since the dataset does not contain train-val-test splits. Hyper-parameters We fine-tune LLMs for 1 epoch on the task data, with a mini-batch size of 64. We use Adam with a learning rate of 10-6 and a cosine learning rate scheduler (with 5% of the epoch used for warmup). We set weight decay to 0.1 and the gradient clipping maximum norm to 1. We utilize 16-bit (mixed) precision training and gradient checkpointing. We exponentially decay the log-barrier parameter µ during fine-tuning from 1 to 10-6 and use a smoothing factor of 0.1 for the exponential moving average. Lastly, we use top-p sampling (p set to 0.9) for response generation. Apart from this, we use the default hyper-parameters in the Transformers library (Wolf et al., 2020). A.1 LEARNING PREFERENCE REWARD MODELS While some preference reward functions are engineered or rule-based, others are learned. Such preferences can often be difficult to quantify. Alternatively, it is easier to compare responses with respect to the preference, e.g., ranking them from most to least helpful. Consequently, the data for learning preference reward models consist of tuples of the form (x, y+, y ), where the prompt x is accompanied by two responses y+ and y , with a preference for the former response over the latter. The preference reward model is denoted by rϕ( ) (parameterized by ϕ). Assuming the Bradley-Terry model (Bradley & Terry, 1952), the model s predicted probability for preferring y+ over y is: prϕ (y+ y |x) = σ (rϕ (y+|x) rϕ (y |x)) , with the standard logistic function σ( ). Then, the model minimizes the negative log-likelihood: min ϕ E(x,y+,y ) t( ) log prϕ (y+ y |x) . Taking inspiration from Rafailov et al. (2023), we initialize the preference reward model rϕ( ) as a pre-trained LLM and set the reward to be its length-normalized log-likelihood. In this way, we utilize the pre-trained model fully, not just its backbone. As the preference reward model is fine-tuned, its log-likelihoods/rewards are updated to differentiate the preferred responses from the rejected ones. Helpful and harmless data We use the Helpful and Harmless (Bai et al., 2022) preference data to learn two reward models, respectively. We obtain 161k training samples and 9k test samples after filtering the (prompt, response) pairs to a maximum token length of 2024; 3/4-th are for helpfulness while the remaining 1/4-th are for harmlessness. We further set 5% of the training data for validation. Hyper-parameters We initialize all reward models with LLa MA-7B (Touvron et al., 2023). We fine-tune for 2 epochs with a mini-batch size of 64. We use Adam with a learning rate of 10-6 and a cosine learning rate scheduler (with 10% of the epoch used for warmup). We set weight decay to 0.1 and the gradient clipping maximum norm to 1. We utilize 16-bit (mixed) precision training and gradient checkpointing. Apart from this, we use the default hyper-parameters in the Transformers library (Wolf et al., 2020). We validate after every 10% of the epoch and save the best model. Published as a conference paper at ICLR 2025 B POLICY GRADIENT We are interested in the policy gradients θCi(θ) s. Note that while computing gradients with respect to the parameters of a distribution in an expectation, we can use the log-derivative trick: θEx pθ( ) [f (x)] = Z dxf (x) θpθ (x) = Z dxf (x) pθ (x) pθ (x) θpθ (x) = Z dxf (x) pθ (x) θ log pθ (x) = Ex pθ( ) [f (x) θ log pθ (x)] . Applying the above to the policy gradient θCi(θ) yields: θCi (θ) = θE(x, ) p( ) y πθ( |x) [ci (y|x)] = E(x, ) p( ) y πθ( |x) [ci (y|x) θ log πθ (y|x)] , (12) where ci(y|x) = bi ri(y|x). This is the simplest form of the policy gradient and can be estimated as Monte Carlo averages. We refer readers to Schulman et al. (2016) for a review of other estimates.