# multiobjective_antibody_design_with_constrained_preference_optimization__e5a5f162.pdf Published as a conference paper at ICLR 2025 MULTI-OBJECTIVE ANTIBODY DESIGN WITH CONSTRAINED PREFERENCE OPTIMIZATION Milong Ren3,4, Zaikai He1,3,4, Haicang Zhang1,2 1Medicinal Bioinformatics Center, Shanghai Jiao Tong University School of Medicine 2Central China Research Institute of Artificial Intelligence 3Institute of Computing Technology, Chinese Academy of Sciences 4University of Chinese Academy of Sciences Antibody design is crucial for developing therapies against diseases such as cancer and viral infections. Recent deep generative models have significantly advanced computational antibody design, particularly in enhancing binding affinity to target antigens. However, beyond binding affinity, antibodies should exhibit other favorable biophysical properties such as non-antigen binding specificity and low self-association, which are important for antibody developability and clinical safety. To address this challenge, we propose Ab Novo, a framework that leverages constrained preference optimization for multi-objective antibody design. First, we pre-train an antigen-conditioned generative model for antibody structure and sequence co-design. Then, we fine-tune the model using binding affinity as a reward while enforcing explicit constraints on other biophysical properties. Specifically, we model the physical binding energy with continuous rewards rather than pairwise preferences and explore a primal-and-dual approach for constrained optimization. Additionally, we incorporate a structure-aware protein language model to mitigate the issue of limited training data. Evaluated on independent test sets, Ab Novo outperforms existing methods in metrics of binding affinity such as Rosetta binding energy and evolutionary plausibility, as well as in metrics for other biophysical properties like stability and specificity. 1 INTRODUCTION Antibodies are vital immune proteins that bind to target antigens and trigger the adaptive immune response. Antibody design is critical in developing therapeutic drugs for a wide range of diseases such as cancer, autoimmune deficiency, and virus infections, with over a hundred antibody drugs currently approved (Kaplon et al., 2022). Structurally, antibodies consist of conserved framework regions and highly variable complementarity-determining regions (CDRs). The framework regions maintain the overall antibody structure, whereas CDRs exhibit significant variability in both structure and sequence and mainly determine antigen binding. Thus, the primary objective of computational antibody design is to design CDRs that bind the target antigens and possess favorable biochemical properties. Recent deep generative models have achieved remarkable success in computational antibody design, particularly in enhancing antigen-specific binding affinity (Luo et al., 2022; Zhu et al., 2024; Martinkus et al., 2024; Lin et al., 2024; Zhou et al., 2024). For example, Diff Ab (Luo et al., 2022) employs a denoising diffusion probabilistic model for antibody structures and sequences co-design. Ab X (Zhu et al., 2024) utilizes a score-based diffusion model and incorporates geometric, physical, and evolutionary constraints to guide the design process. Notably, the recent method ABDPO (Zhou et al., 2024) integrates physical energy as guidance for binding affinity within the framework of direct preference optimization. While binding affinity is crucial, antibodies should exhibit other favorable biophysical properties such as high target specificity and low self-association, which is important for downstream developability and clinical safety. For example, non-specific (off-target) binding can potentially trigger Correspondence should be addressed to H. Zhang (zhanghaicang@sjtu.edu.cn) Published as a conference paper at ICLR 2025 unintended immune responses, causing inflammation or other syndromes (Nicholson et al., 1991; Ferrigno, 2016; Makowski et al., 2022). High self-association, mainly due to a large amount of charged aminos on CDRs, leads to antibody aggregation and results in decreased efficacy (Sule et al., 2011; Makowski et al., 2024). In wet-lab experiments, a common strategy is to generate a diverse set of candidates and then filter them based on desired properties (Watson et al., 2023; Bennett et al., 2023). However, this post-filtering approach is inefficient and has a lower success rate in designing antibodies that satisfy all specified constraints. To address these challenges and bridge the gap between in silico design and real-world applications, we propose Ab Novo, a framework that leverages constrained preference optimization for multi-objective antibody design. We first pre-train a score-based diffusion model for antibody structure and sequence co-design. We then employ constrained preference optimization to fine-tune the model using metrics of binding affinity such as Rosetta binding energy and evolutionary plausibility as objectives, while incorporating biophysical properties related to non-specific binding, self-association, and stability as constraints. During training, we utilize a primal-dual approach to iteratively optimize the policy model. In contrast to the previous framework of constrained direct preference that leverages pairwise preferences, we model the physical binding energy as continuous rewards and collect both reward and constraint values offline. We introduce additional techniques to improve training stability and performance, such as incorporating a structure-aware protein language model to alleviate the issue of scarcity of training data. Our contributions are summarized as follows: We propose Ab Novo, a deep generative model for multi-objective antibody design that incorporates explicit constraints representing biophysical properties critical for real-world antibody development. We extend the recent framework of constrained preference optimization from language model alignment to diffusion-based generative models. This includes developing a corresponding training algorithm, supported by theoretical derivations and analysis. We introduce additional simple yet effective techniques to enhance performance. For instance, we incorporate a structure-aware protein language model trained on massive structural data beyond antibodies to alleviate overfitting issues due to the scarcity of antibodyantigen training data. Experiment results show that Ab Novo achieves state-of-the-art performance in metrics of binding affinity while also satisfying other biophysical properties. 2 RELATED WORKS 2.1 COMPUTATIONAL ANTIBODY DESIGN Antibody design is to optimize the antibody structure and sequences, particularly the CDRs, to bind to target antigens while meeting other biophysical properties. Traditional methods rely on computationally intensive Monte Carlo-based simulations to optimize the antibody-antigen binding energy (Adolf-Bryfogle et al., 2018). Recent deep learning-based methods for antibody design can be broadly categorized into discriminative models and generative models. Discriminative models typically leverage graph neural networks to learn the presentations of antigen structure and to predict the most likely antibody structure and sequence (Kong et al., 2023; Lin et al., 2024). In contrast, generative models such as the denoising diffusion probabilistic models (DDPM) (Luo et al., 2022; Martinkus et al., 2024; Tan et al., 2024) and score-based diffusion models (Zhu et al., 2024; Kulyt e et al., 2024) build an antigen-conditioned generative process of antibody sequences and structure. Another trend is incorporating guidance into the generative process. Ab X(Zhu et al., 2024) leverages evolutionary, physical, and geometric constraints to narrow down the plausible structure and sequence sampling space. The method most related to our work is Ab DPO (Zhou et al., 2024), which applies preference optimization to enhance binding affinity. Our method differs from Ab DPO in terms of framework and training objective. First, Ab DPO focuses on optimizing towards a lower Rosetta energy using the framework of direct preference optimization (DPO), whereas our method employs constrained preference optimization that optimizes binding affinity while imposing Published as a conference paper at ICLR 2025 constraints related to specificity and self-association. Second, regarding the preference objective, Ab DPO uses pairwise preferences, whereas our method employs continuous rewards to model the physical binding energy. 2.2 PREFERENCE OPTIMIZATION OF GENERATIVE MODELS In natural language processing, large language models (LLMs) have achieved remarkable progress in natural language generation (Achiam et al., 2023; Touvron et al., 2023; Dubey et al., 2024). To better align these models with human values and preferences, various preference optimization frameworks (Azar et al., 2024; Chen et al., 2024b; Ethayarajh et al., 2024; Rafailov et al., 2024b; Meng et al., 2024; Zeng et al., 2024) have been developed. For example, Reinforcement Learning from Human Feedback (RLHF) pre-trains an explicit reward model and then fine-tunes the base models through reinforcement learning (Ouyang et al., 2022). Direct Preference Optimization (DPO) (Rafailov et al., 2024b) derives a closed form for the optimal policy and directly fine-tunes the base model with preferences data rather than an explicit reward model. More recently, DPO has been adapted from LLMs alignment to the diffusion-based generative models for image generation (Wallace et al., 2024). However, RLHF presents several challenges in practice. First, the overoptimization issue arises (Gao et al., 2023; Coste et al., 2023; Eisenstein et al., 2023), as the reward models often serve as an imperfect proxy for true human preferences. Second, a single reward with scalar output is often inadequate to capture multiple aspects of human preferences, such as helpfulness and harmlessness, which are not always easily compatible (Ouyang et al., 2022; Ganguli et al., 2022; Thoppilan et al., 2022). To mitigate these issues, constrained RLHF (Moskovitz et al., 2023) has been proposed, which fine-tunes LLMs by maximizing a target reward while imposing explicit constraints on auxiliary safety objectives like harmlessness. More recently, the framework of constrained DPO (Liu et al., 2024; Huang et al., 2024) has also been proposed for LLMs alignment. Our method is closely related to constrained DPO, but we extend it from LLMs alignments to the diffusion-based generative models, providing rigorous theoretical derivation. Furthermore, by integrating noise contrastive estimation with constrained DPO, we model physical binding energy using continuous rewards instead of relying on pairwise preference data. Specifically, we evaluate both reward and constraint values for sampled antibodies using existing models that assess antibody biophysical properties. Stage 1: Training Base Model Stage 2: Preference Optimization Structureaware LM Structureaware LM Folding Trunk & IPA Masked sequence recovery loss, Structure prediction loss Structure/Sequence DSM loss, Violation loss, Auxilary loss Structureaware LM Folding Trunk & IPA AFDB (2M) + PDB(200k) SAb Dab (7k) Sample 1 Rewards: 1.60 Constrain function values: 0.10, 0.50, 0.96 Sample 2 Rewards: 1.13 Constrain function values: 0.30, 0.48, 0.11 Sample N Rewards: 1.44 Constrain function values: 0.78, 0.87, 0.45 Structureaware LM Folding Trunk & IPA Reference model Policy model Objectives Constraints Antigen Non-target Binding Affinify Selfassociation Non-specific Binding Stability Language Model 0.71 0.60 ... 0.25 Evolutionary Plausibility (collect offline data) Figure 1: The overview of Ab Novo. In the first stage, we train the base model, a diffusion-based generative model for the co-design of antibody CDR sequences and structures. In the second stage, we update the policy model guided by biophysical properties, using a reference model initialized from the base model as a regularization term during this update. Published as a conference paper at ICLR 2025 As illustrated in Figure 1, Ab Novo comprises two main phases during training. First, we train an antigen-conditioned generative model for the co-design of antibody structure and sequence. We refer to this co-design model as the base model. Next, we fine-tune the base model using constrained preference optimization. In this process, we refer to the network being optimized as the policy model. The policy model is optimized based on biophysical properties such as Rosetta Binding Energy and self-association, while a reference model is used as a regularization term during this process, preventing over-optimization. The reference model is initialized from the base model. We briefly introduce the preliminaries and notations used in our methods in Section 3.2. Subsequently, we describe the base model and the framework of constrained preference optimization in Sections 3.2 and 3.3, respectively. 3.1 PRELIMINARIES Each antibody consists of two identical heavy (H) chains and two identical light (L) chains. The variable domain of each chain is divided into framework regions and three complementarity-determining regions (CDRs): H1, H2, and H3 in the heavy chain, and L1, L2, and L3 in the light chain. Following previous works (Luo et al., 2022; Campbell et al., 2024; Zhu et al., 2024), the structure is represented as elements of SE(3) to capture the local frame along the backbone (Jumper et al., 2021). For antigen-antibody complexes of total length N, each residues can be represented as Ti = (xi, ri, ai), where i = 1, . . . , N. Here, xi R3 is the coordinate of the Cα of the i-th residue, ri SO(3) is the rotation matrix of local frame respect to global frame, and ai {1, 2, . . . , 20} {[mask]} is one of 20 types residues or the mask state [mask]. We assume that all CDRs to be generated has m residues in total, which can be represented by PCDR = {(xi, ri, ai)|i = n + 1, . . . , n + m}. The antibody framework and antigen can be represented by PFC = {(xi, ri, ai)|i = {1, 2, . . . , N}\{n + 1, . . . , n + m}}. Formally, our goal is to model the distribution of PCDR conditioned on PFC. We provide all the notions and their corresponding descriptions in Table 4. 3.2 ANTIBODY STRUCTURE AND SEQUENCE CO-DESIGN Diffusion Process for Sequence and Structure. Following previous works (Yim et al., 2023; Campbell et al., 2024; Zhu et al., 2024), we use a multi-modal diffusion model to jointly design antibody sequence and structure. Specifically, we use the Continuous Time Markov Chain (CTMC)- based diffusion model for discrete sequence and score-based SE(3) diffusion model for structure, respectively. We use T(t) = {T (t) n }n+m i=n+1 = {(x(t) i , r(t) i , a(t) i )}n+m i=n+1 to represent the antibody s structure and sequence at time t and T(0:1) = (T(0), T( t), ..., T(1)) to represent the diffusion path. Here, t follows a uniform distribution U(0, 1) and we apply distinct noise schedules on t for translation x(t) i , rotation r(t) i , and sequence a(t) i , as described in Appendix A.4.2. The forward diffusion for (x(t) i , r(t) i , a(t) i ) can be written as following: x(t) i N x(0) i e t/2, (1 e t)Id3 , r(t) i IGSO(3) r(0) i , t , a(t) i Cat δ{a(0) i , a(t) i }(1 t) + δ{a(1) i , a(t) i }t . Here, δ{i, j} is the Kronecker delta function which is 1 when i equals to j and is 0 otherwise. N, IGSO(3), and Cat represent the Gaussian, Isotropic Gaussian, and Categorical distributions, respectively. We use sx θ, sr θ, and sa θ to represent the score network of translation, rotation, and sequence, respectively. We select the prior distribution of structure and sequence as px,r prior(x(1) i , r(1) i ) = P#(N(0, Id3) N) IGSO(3)(0, Id) N and pa prior(a(1) i ) = {[mask]} N, respectively. Then, the reverse diffusion process can be written as follows: x(t t) i P# N 1 2 t x(t) i + t sx θ(x(t) i , T(t), PFC, t), t Id3 , r(t t) i IGSO(3) expr(t) i ( t sr θ(r(t) i , T(t), PFC, t)), t Id , a(t t) i Cat δ{a(t) i , a(t t) i } + t sa θ(a(t t) i , T(t), PFC, t) , Published as a conference paper at ICLR 2025 where exp and log are the exponential and logarithmic maps, and P# is the projection matrix that removes the center of mass 1 Network Architecture. Inspired by previous works (Huguet et al., 2024; Zhu et al., 2024; Ren et al., 2024), we utilize a score network similar to the architecture proven useful in protein structure prediction (Lin et al., 2023; Jumper et al., 2021). It comprises three components: a structure-aware language model we pre-trained on 2.2M structures, the Evoformer trunks, and an Invariant Point Attention (IPA) (Jumper et al., 2021). We first extract embeddings from the structure-aware language model for both the conditional sequences of antibody framework and antigen, and noisy CDRs sequence. Structural information is encoded using a distogram of the antibody framework, antigen, and noisy CDRs, followed by a linear projection. Then, these structure and sequence representations are then processed by Evoformer trunks. Finally, the IPA and an MLP output the denoised CDRs structure and sequence, respectively. Appendix 12 and Figure 4 present more details on feature preprocessing and network architecture. Training Losses. Following previous methods (Campbell et al., 2024; Zhu et al., 2024), the losses for training the base model primarily include Denoised Score Matching (DSM) losses for structure and sequence, auxiliary losses, and structural violation loss. Details of these losses are provided in Appendix A.4.5 presents the details of these losses. 3.3 CONSTRAINED PREFERENCE OPTIMIZATION To design antibodies that bind to target antigens while satisfying other biophysical properties, the training objective is formulated as: max θ J (R)(pθ) = EPFC D h ET(0) px,r,a θ (T(0)|PFC) h X m ωm Rm(T(0), PFC) i βDKL px,r,a θ (T(0:1)|PFC)||px,r,a ref (T(0:1)|PFC) i , s.t. EPFC D,T(0) px,r,a θ (T(0)|PFC) h Cn(T(0), PFC) i < Cn, for all n in constraints sets, where Rm and ωm denote a normalized reward (e.g., Rosetta Binding Energy, Evolutionary Plausibility) and its weight, and Cn and Cn denote a constraint (e.g., Non-specific Binding, Selfassociation, Stability) and its threshold. In this equation, the first term of the objective is to maximize rewards, while the second term is to keep the policy model close to the reference model. Additionally, the optimization is constrained to ensure that expected constraints do not exceed their thresholds. Both the reward and constraint values of sampled antibodies are computed offline, as detailed in Appendix A.5. This problem can be associated with a Lagrangian function and then the max-min optimization problem follows: max pθ min λ RN + J (L)(pθ, λ) = J (R)(pθ) J (C)(pθ, λ), (4) J (C)(pθ, λ) = X n λn h EPFC D,T(0) px,r,a θ (T(0)|PFC) Cn(T(0), PFC) Cn i , (5) where J (R) is the objective function in Equation 3 and λ = [λ1, ..., λn] is the vector of dual variables. This equation can be interpreted as appending a penalty J (C) to the original objective J (R). The penalty term, which depends on how much the antibody generated by the current model violates constraints, can be adjusted dynamically through the Lagrange multipliers λ. Then this problem can be solved using the primary-dual method (Bertsekas, 2014; Ito & Kunisch, 2008; Liu et al., 2024) by iteratively taking two steps : 1. Update policy: Update network parameters θ to find p λ = arg maxpθ J (L)(pθ, λ) based on current value of λ. 2. Update λ: Update λ by estimating the gradient of dual function G(λ) = J (L)(p λ, λ) based on policy p λ. Published as a conference paper at ICLR 2025 Prior work has proved that the objective function is concave over pθ, and thus strong duality holds (Liu et al., 2024; Huang et al., 2024). A formal proof is provided in the Appendix A.3.1. More specifically, we first sample a set of antibodies, denoted as Dg, from the reference model pref and compute all reward and constraint values offline for these samples. Then, based on Dg, we iteratively perform the two steps to optimize the policy model pθ, as detailed in sections 3.3.1 and 3.3.2, respectively. 3.3.1 UPDATE POLICY Given the reference model and current value of λ along with corresponding preference data, previous methods (Rafailov et al., 2024b; Liu et al., 2024; Huang et al., 2024) leverage the DPO framework to determine p λ. Here, we have significantly adapted this framework for the antibody design task. Direct Perference Optimization with Continuous Rewards. Recent work of NCA (Chen et al., 2024a) has extended the DPO framework to incorporate continuous reward values for large language model alignment. Since many biophysical properties (e.g., physical energy) are continuous values, we have further adapted NCA for diffusion-based generative models and integrated it into the constrained preference optimization framework. The detailed theoretical derivation is presented in the Appendix A.3.2. Then, the optimal policy p λ has the following form: p λ pref exp( ˆR/β), (6) m ωm Rm(T(0), PFC) X n λn Cn(T(0), PFC), (7) where T(0) Dg are the sampled antibodies under the reference model, and β is the regulartizion weight in Equation 3. Integrating with NCA, the original training objective in Equation 4 can be reformulated as follows (Appendix A.3.2): Ldiff NCA(θ) = E{T(0) i ,PFC, ˆ Ri}1:K Dg " exp ( ˆRi/β) j=1 exp ( ˆRj/β) log σ (fi) + 1 K log σ ( fi) fi = ET( t:1) px,r,a θ (T( t:1)|T(0) i ,PFC) log px,r,a θ (T(0:1)|PFC) px,r,a ref (T(0:1)|PFC) Here, we use Dg to represent antibodies sampled from the reference model and σ is the sigmoid function. In this equation, the first term is to increase the likelihood of the samples based on their rewards and the second one serves as a regularization term. Due to the objective in Equation 8 is inefficient and intractable to train, we utilize Jensen s inequality and approximate the reverse process with the forward diffusion qx,r,a (Wallace et al., 2024; Zhou et al., 2024). As derived in Appendix A.3.2, the simplified objective is as follows: Ldiff NCA(θ) = Et U(0,1),{T(0) i ,PFC, ˆ Ri}1:K Dg, T(t) i qx,r,a(T(t) i |T(0) i ) h exp ( ˆRi/β) j=1 exp ( ˆRi/β) log σ( 1 Fi = DKL qx,r,a(T(t t) i |T(0,t) i , PFC)||px,r,a θ (T(t t) i |T(0,t) i , PFC) + DKL qx,r,a(T(t t) i |T(0,t) i , PFC)||px,r,a ref (T(t t) i |T(0,t) i , PFC) . (9) Increase the Likelihood of Samples with High Rewards. Previous studies have found that the DPO-based approach makes the likelihood of the optimal preference samples decrease (Rafailov et al., 2024a). Chen et al. (2024a) demonstrated that the NCA training objectives ensure that the likelihood of the optimal reward does not decrease. Building on this, we kept the training objectives from the base model, Lsup, to further increase the likelihood of samples with higher rewards. Then Published as a conference paper at ICLR 2025 the total loss can be written as: Lupdate policy = Ldiff NCA + α(sup) K X i=1 max 0, ˆ Ri 1 Lsup,i. (10) 3.3.2 UPDATE λ This step involves calculating the gradient of the Lagrange multipliers by assessing the extent of constraint violation in the current policy. The gradient of G(λ) can be written as follows: dλ = EPFC D,T(0) p λ(T(0)|PFC) h C C(T(0), PFC) i , (11) where C = [C1, ..., Cn] and C = [C1, ..., Cn]. In this equation, the gradient of G(λ) can be calculated by the expected degree of constraint violation in the sampled antibodies under the current policy model p λ. Considering that the optimal solution of Equation 9 under λ can be written as p λ = 1 Zλ pref exp( ˆR/β), the specific estimation approach can be deduced into a closed-form: dλ = EPFC Dg h ET(0) pref(T(0)|PFC) exp( ˆR/β)(C C(T(0), PFC)) ET(0) pref(T(0)|PFC) exp( ˆR/β) By this way, we can estimate the gradient of the G(λ) in an offline manner. The details of the estimation process are provided in Appendix A.3.3. 3.3.3 ITERATIVE OPTIMIZATION Following previous work (Dubey et al., 2024), we update the reference model with the latest trained policy model and conduct several rounds of constrained preference optimization. Specifically, in the k-th round, we use the trained policy model from the (k 1)-th round to update the reference model and collect sampled antibodies and their reward and constraint values offline. The entire training process for constrained preference optimization is detailed in Algorithm 1. 4.1 EXPERIMENTS SETUP Training and Testing Sets. We trained Ab Novo using antibody-antigen complex structures derived from the SAb Dab database (Dunbar et al., 2014) and evaluated its performance on the RAb D test set, which is widely used for in silico antibody design. During testing, we simultaneously generated all six CDRs conditioned on the antigen and the antibody framework regions. Following previous work, we strictly eliminated any overlap between the training and test sets by applying a 40% sequence identity threshold on CDR-H3. More details on the preparation of the training and testing datasets are provided in Appendix A.4.4. Baseline Methods. We compare Ab Novo with representative methods from each category: discriminative model dy MEAN (Kong et al., 2023) and Geo Ab (Lin et al., 2024); and diffusion-based generative models Diff Ab (Luo et al., 2022) and Ab X (Zhu et al., 2024). Since dy MEAN does not use native antibody framework structure as input, its settings differ from other methods, which may lead to an underestimation of dy MEAN s performance in the comparisons presented in our experiments. We note that Ab Diffuser (Martinkus et al., 2024) and Ab DPO (Zhou et al., 2024) are unavailable for benchmarking. More details on running these methods are in Appendix A.6. Evaluation Metrics. We group the evaluation metrics into two categories: reference-based metrics and reference-independent metrics. The reference-based metrics assess the similarity between the designed and native antibody structures and sequences. Specifically, Amino Acid Recovery (AAR, %) measures the sequence recovery accuracy by comparing the generated sequences to the native sequences. Root Mean Square Deviation (RMSD, A) calculates the structural deviation between Cα coordinates of the generated and native CDRs. Published as a conference paper at ICLR 2025 Algorithm 1 Constrained Preference Optimization for Antibody Design 1: Input: Antigen-antibody complex dataset D, base model p(0) θ , preference loss L, dual function G, reward functions {Rm(T(0), PFC)}M m=1, constraint functions {Cn(T(0), PFC)}N n=1, initial vector of dual variables λ, weights of the rewards {ωm}M m=1, learning rate (ηθ, ηλ), number of rounds of constrained preference optimization K, number of training steps during each round B, number of data for training V . 2: for k = 1 to K do # Initialize reference model and policy model of the current round 3: p(k) ref p(k 1) θ 4: p(k) θ p(k 1) θ # Collect the antibody samples from reference model offline 5: Dg = {(T(0) i , PFC,i)|V i=1 s.t. PFC,i D, T(0) i p(k) ref (T(0)|PFC,i)} # Compute the reward and constraint values of the antibody samples in Dg offline 6: Rm,i, Cn,i Evaluate(Dg) 7: for b = 1 to B do # Annotate Dg with Equation 7 under current λ 8: ˆ Ri(T(0) i , PFC) = P m ωm Rm,i P n λn Cn,i # Update policy model θ as Equation 9 9: p(k) θ Policy Optimizer(p(k) θ , L, ηθ, Dg) # Update λ as Equation 12 10: λ Lambda Optimizer(λ, G, ηλ, Dg) 11: end for 12: end for 13: return p(K) θ The reference-independent metrics evaluate properties without direct comparison to native antibodies. Rosetta Binding Energy assesses the binding affinity of the designed antibodies to target antigens. Evolutionary Plausibility is evaluated using the likelihood under an independent antibody language model (Shuai et al., 2021). Additionally, we consider the proportion of generated antibodies that satisfy constraints related to self-association, stability, and non-specific binding. For each method, we designed 128 antibodies per antigen and evaluated their average metrics. Details of these metrics and the thresholds for each constraint are provided in Appendix A.5. 4.1.1 EVALUATION ON MULTI-OBJECTIVE ANTIBODY DESIGN As shown in Table 1, Ab Novo outperforms all baseline methods across all reference-independent metrics. These results indicate that Ab Novo not only excels in designing antibodies with superior binding energy and evolutionary plausibility but also achieves the lowest percentage of constraint violations compared to other methods. Furthermore, in comparison to our base model, Ab Novo demonstrates significant improvements across all metrics, underscoring the effectiveness of constrained preference optimization. Subsequently, we evaluated Ab Novo on reference-based metrics. As shown in Table 5, our method achieves superior performance on these metrics compared to all baseline methods. 4.2 ABLATION STUDIES We trained several ablation models to investigate the relative importance of the core components of our method. First, we compare constrained preference optimization with supervised fine-tuning (SFT). In the SFT setting, we selected the optimal sample for each antigen as training data, based on the weighted rewards for each property (SFT in Table 2). As presented in Table 2, while SFT improves the performance of the base model, it remains less effective compared to preference optimization learning. 1Geo Ab only designs the CDR H3 of the antibody. For the other CDRs, we utilize the natural antibody sequence and structure for evaluation. Published as a conference paper at ICLR 2025 Table 1: Evaluation of reference-independent metrics on RAb D test set. Here, reference represents the native antibody structure and sequence in RAb D dataset. Methods Binding Energy ( ) Evolutionary Plausibility ( ) Selfassociation ( ) Stability ( ) Non-specific Binding ( ) reference -19.41 2.38 12.3% 0% 3.5% Diff Ab -0.96 2.60 7.6% 15.6% 2.3% dy MEAN -1.74 2.82 50.8% 94.5% 1.8% Geo Ab 1 -1.75 2.69 38.5% 7.6% 4.3% Ab X 4.79 2.44 14.5% 4.8% 11.6 % Ab Novo (base) -2.60 2.41 19.9% 2.9% 10.9% Ab Novo -12.05 2.36 2.3% 2.8% 1.7% Second, we compare constrained preference optimization with preference optimization used in Ab DPO (Zhou et al., 2024). For preference optimization, we convert all constraints considered in Ab Novo into optimization objectives (Multi-objective in Table 2). We observed a slight increase in fulfilling all constraints but a significant drop in performance in Binding Energy and Evolutionary Plausibility. This indicates that for antibody design, certain biophysical properties are more suitably treated as constraints rather than optimization objectives. This phenomenon has also been observed in previous studies on language model alignment (Liu et al., 2024; Huang et al., 2024). Third, we included two ablation experiments to demonstrate the relative contribution of the structure-aware language model that is used to alleviate the scarcity of antibody-antigen complex data. i) We trained an ablation model where we excluded the embeddings of the language model as input features (w.o. LM in Table 2). We observed significant drops across nearly all metrics, indicating the importance of the language model. ii) We also trained an ablation model where we replaced this structure-aware language model with a sequence-only language model (ESM-2 based in Table 2). It shows that the structure-aware model yielded better results than the sequence-only language model. Additionally, we assessed its ability to predict long-distance contacts (Appendix A.2.4) and found that the structure-aware language model significantly outperforms the pure sequence language model on independent test sets, including the CASP and CAMEO datasets. Finally, we analyzed the effect of varying the number of iteration rounds K in constrained preference optimization on performance. As shown in Table 3, we observed that increasing the number of iteration rounds resulted in improvements across overall metrics, with a higher proportion of generated samples satisfying the constraints. Table 2: Ablation studies for Ab Novo on RAb D dataset. Ablation studies for Ab Novo on the RAb D dataset. The ablation experiment settings include: without using a language model (w.o. LM), replacing the structure-aware language model with ESM2 (ESM-2 based), using supervised finetuning instead of preference optimization (SFT), and incorporating all constraints into the optimization objectives (Multi-objective). Methods Binding Energy ( ) Evolutionary Plausibility ( ) All Constrains ( ) AAR ( ) RMSD ( ) w.o. LM 7.54 2.67 46.5% 41.53% 3.19 ESM-2 based 1.75 2.40 30.8% 49.2% 2.55 Ab Novo (base) -2.60 2.41 26.7% 49.9% 2.19 SFT -6.46 2.37 6.5% 48.8% 2.41 Multi-objective -4.05 2.39 2.6% 42.7% 2.43 Ab Novo -12.05 2.35 3.9% 48.5% 2.37 4.3 CASE STUDIES We present a case study (Figure 2) comparing the designed antibodies from different methods: dy MEAN, Diff Ab, and Ab Novo. This case illustrates that antibodies designed by Ab Novo not Published as a conference paper at ICLR 2025 Table 3: Improvements achieved through iterative constrained preference optimization. Iter1-4 represent the model s performance under constrained preference optimization with varying numbers of iterations. Methods Binding Energy ( ) Evolutionary Plausibility ( ) Selfassociation ( ) Stability ( ) Non-specific Binding ( ) Ab Novo (base) -2.60 2.41 19.9% 2.9% 10.9% Iter1 -6.45 2.38 6.5% 8.9% 4.7% Iter2 -11.60 2.38 2.0% 5.2% 5.6% Iter3 -12.05 2.36 2.3% 2.8% 1.7% only exhibit higher binding affinity to target antigens but also fully satisfy all constraints. Previous studies have shown that a larger area of negatively charged patches in the CDRs corresponds to a higher risk of self-association in wet-lab experiments (Makowski et al., 2024). We see that dy MEAN produce a large number of charged amino acids which can lead to potential risks of self-association. In detail, we present the distribution of metrics for antibodies designed by different methods for specific antigens. As shown in Figure 3, for Ab Novo-designed antibodies, there is not only a higher percentage of antibodies that satisfy the constraints but also a greater proportion that outperforms natural antibodies across binding energy and evolutionary plausibility. dy MEAN: Binding energy: -3.01 pp L: 2.85 Stablity, self-association, specificity: [False, False, True] Diff Ab: Binding energy: 3.66 pp L: 2.49 Stablity, self-association, specificity: [True, True True] Ours: Binding energy: -16.75 pp L: 2.21 Stablity, self-association, specificity: [True, True, True] Ab X: Binding energy: -3.62 pp L: 2.22 Stablity, self-association, specificity: [True, True, False] ARDLGYGSEDT TRLKYYDPSGDY TRLKRLYDVPDY TRRGTFYGSFDY native sequance: TRRNTLGDYFDY Negatively charged Positively charged Hydrophobic Hydrophilic Figure 2: Visualization of the antibody designed by dy MEAN, Diff Ab, and Ab Novo for given antigens (PDB ID: 5NUZ). We used orange and blue to identify all CDRs of the designed antibody and all CDRs of the natural antibody. False and True are used to indicate whether the designed antibody satisfies different constraints or not. We presented the CDR H3 sequences designed by different methods, indicating the biophysical properties, such as hydrophilicity and hydrophobicity. 5 DISCUSSION To design antibodies with strong affinity while enhancing their developability and clinical safety, we propose Ab Novo, a constrained preference optimization framework. Experimental results demonstrate that the Ab Novo framework improves the affinity and evolutionary plausibility of designed antibodies by ensuring specificity, stability, and minimizing self-association. The limitations of Ab Novo mainly lie in the following aspects. First, while Ab Novo emphasizes various in silico metrics for evaluation, a limitation is the absence of wet-lab experimental validation, which will be addressed in our future work. Second, although Ab Novo can design antibodies in scenarios with unknown antigen-antibody binding positions by integrating structure prediction and docking methods, the complexity of this pipeline may lead to error accumulation. Third, though metrics like Rosetta energy are widely used in evaluating antibody design, they still do not perfectly align with wet-lab experiments. Our framework is adaptable and can incorporate other physicochemical properties as rewards and constraints. Published as a conference paper at ICLR 2025 CODE AVILIBILITY Code for Ab Novo can be found at https://github.com/Carbon Matrix Lab/Ab Novo. ACKNOWLEDGMENTS We acknowledge the financial support provided by the National Natural Science Foundation of China (Grant No. 32370657) and the Plan for Advancing Scientific Research Paradigms and Empowering Disciplinary Upgrades Through Artificial Intelligence in Shanghai, awarded to H.Z. Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, Lindsay Willmore, Andrew J Ballard, Joshua Bambrick, et al. Accurate structure prediction of biomolecular interactions with alphafold 3. Nature, pp. 1 3, 2024. Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. ar Xiv preprint ar Xiv:2303.08774, 2023. Jared Adolf-Bryfogle, Oleks Kalyuzhniy, Michael Kubitz, Brian D Weitzner, Xiaozhen Hu, Yumiko Adachi, William R Schief, and Roland L Dunbrack Jr. Rosettaantibodydesign (rabd): A general framework for computational antibody design. PLo S computational biology, 14(4):e1006112, 2018. Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pp. 4447 4455. PMLR, 2024. Nathaniel R Bennett, Brian Coventry, Inna Goreshnik, Buwei Huang, Aza Allen, Dionne Vafeados, Ying Po Peng, Justas Dauparas, Minkyung Baek, Lance Stewart, et al. Improving de novo protein binder design with deep learning. Nature Communications, 14(1):2625, 2023. Dimitri P Bertsekas. Constrained optimization and Lagrange multiplier methods. Academic press, 2014. Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rainforth, and Tommi Jaakkola. Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design. ar Xiv preprint ar Xiv:2402.04997, 2024. Huayu Chen, Guande He, Hang Su, and Jun Zhu. Noise contrastive alignment of language models with explicit rewards. ar Xiv preprint ar Xiv:2402.05369, 2024a. Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models. ar Xiv preprint ar Xiv:2401.01335, 2024b. Thomas Coste, Usman Anwar, Robert Kirk, and David Krueger. Reward model ensembles help mitigate overoptimization. ar Xiv preprint ar Xiv:2310.02743, 2023. Chai Discovery, Jacques Boitreaud, Jack Dent, Matthew Mc Partlon, Joshua Meier, Vinicius Reis, Alex Rogozhnikov, and Kevin Wu. Chai-1: Decoding the molecular interactions of life. bio Rxiv, pp. 2024 10, 2024. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. ar Xiv preprint ar Xiv:2407.21783, 2024. James Dunbar, Konrad Krawczyk, Jinwoo Leem, Terry Baker, Angelika Fuchs, Guy Georges, Jiye Shi, and Charlotte M Deane. Sabdab: the structural antibody database. Nucleic acids research, 42(D1):D1140 D1146, 2014. Published as a conference paper at ICLR 2025 Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alex D Amour, DJ Dvijotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachandran, et al. Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking. ar Xiv preprint ar Xiv:2312.09244, 2023. Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. ar Xiv preprint ar Xiv:2402.01306, 2024. Paul Ko Ferrigno. Non-antibody protein-based biosensors. Essays in biochemistry, 60(1):19 25, 2016. Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. ar Xiv preprint ar Xiv:2209.07858, 2022. Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp. 10835 10866. PMLR, 2023. Brian L Hie, Varun R Shanker, Duo Xu, Theodora UJ Bruun, Payton A Weidenbacher, Shaogeng Tang, Wesley Wu, John E Pak, and Peter S Kim. Efficient evolution of human antibodies from general protein language models. Nature Biotechnology, 42(2):275 283, 2024. Xinmeng Huang, Shuo Li, Edgar Dobriban, Osbert Bastani, Hamed Hassani, and Dongsheng Ding. One-shot safety alignment for large language models via optimal dualization. ar Xiv e-prints, pp. ar Xiv 2405, 2024. Guillaume Huguet, James Vuckovic, Kilian Fatras, Eric Thibodeau-Laufer, Pablo Lemos, Riashat Islam, Cheng-Hao Liu, Jarrid Rector-Brooks, Tara Akhound-Sadegh, Michael Bronstein, et al. Sequence-augmented se (3)-flow matching for conditional protein backbone generation. ar Xiv preprint ar Xiv:2405.20313, 2024. Kazufumi Ito and Karl Kunisch. Lagrange multiplier approach to variational problems and applications. SIAM, 2008. John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. nature, 596(7873):583 589, 2021. Sonoko Kanai, JUN Liu, Thomas W Patapoff, and Steven J Shire. Reversible self-association of a concentrated monoclonal antibody solution mediated by fab fab interaction that impacts solution viscosity. Journal of pharmaceutical sciences, 97(10):4219 4227, 2008. Hélène Kaplon, Alicia Chenoweth, Silvia Crescioli, and Janice M Reichert. Antibodies to watch in 2022. In MAbs, volume 14, pp. 2014296. Taylor & Francis, 2022. Xiangzhe Kong, Wenbing Huang, and Yang Liu. End-to-end full-atom antibody design. In International Conference on Machine Learning, pp. 17409 17429. PMLR, 2023. Paulina Kulyt e, Francisco Vargas, Simon Valentin Mathis, Yu Guang Wang, José Miguel Hernández Lobato, and Pietro Liò. Improving antibody design with force-guided sampling in diffusion models. ar Xiv preprint ar Xiv:2406.05832, 2024. Jiahan Li, Chaoran Cheng, Zuofan Wu, Ruihan Guo, Shitong Luo, Zhizhou Ren, Jian Peng, and Jianzhu Ma. Full-atom peptide design based on multi-modal flow matching. ar Xiv preprint ar Xiv:2406.00735, 2024. Haitao Lin, Lirong Wu, Yufei Huang, Yunfan Liu, Odin Zhang, Yuanqing Zhou, Rui Sun, and Stan Z Li. Geoab: Towards realistic antibody design and reliable affinity maturation. In Fortyfirst International Conference on Machine Learning, 2024. Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123 1130, 2023. Published as a conference paper at ICLR 2025 Zixuan Liu, Xiaolin Sun, and Zizhan Zheng. Enhancing llm safety via constrained direct preference optimization. ar Xiv preprint ar Xiv:2403.02475, 2024. Shitong Luo, Yufeng Su, Xingang Peng, Sheng Wang, Jian Peng, and Jianzhu Ma. Antigen-specific antibody design and optimization with diffusion-based generative models for protein structures. Advances in Neural Information Processing Systems, 35:9754 9767, 2022. Emily K Makowski, Patrick C Kinnunen, Jie Huang, Lina Wu, Matthew D Smith, Tiexin Wang, Alec A Desai, Craig N Streu, Yulei Zhang, Jennifer M Zupancic, et al. Co-optimization of therapeutic antibody affinity and specificity using machine learning models that generalize to novel mutational space. Nature communications, 13(1):3788, 2022. Emily K Makowski, Tiexin Wang, Jennifer M Zupancic, Jie Huang, Lina Wu, John S Schardt, Anne S De Groot, Stephanie L Elkins, William D Martin, and Peter M Tessier. Optimization of therapeutic antibodies for reduced self-association and non-specific binding via interpretable machine learning. Nature Biomedical Engineering, 8(1):45 56, 2024. Karolis Martinkus, Jan Ludwiczak, Wei-Ching Liang, Julien Lafrance-Vanasse, Isidro Hotzel, Arvind Rajpal, Yan Wu, Kyunghyun Cho, Richard Bonneau, Vladimir Gligorijevic, et al. Abdiffuser: full-atom generation of in-vitro functioning antibodies. Advances in Neural Information Processing Systems, 36, 2024. Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. ar Xiv preprint ar Xiv:2405.14734, 2024. Ted Moskovitz, Aaditya K Singh, DJ Strouse, Tuomas Sandholm, Ruslan Salakhutdinov, Anca D Dragan, and Stephen Mc Aleer. Confronting reward model overoptimization with constrained rlhf. ar Xiv preprint ar Xiv:2310.04373, 2023. Suellen Nicholson, David E Leslie, Theodora Efandis, Christopher K Fairley, and Ian D Gust. Hepatitis c antibody testing: problems associated with non-specific binding. Journal of virological methods, 33(3):311 317, 1991. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 27730 27744, 2022. Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea Finn. From r to q : Your language model is secretly a q-function. ar Xiv preprint ar Xiv:2404.12358, 2024a. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024b. Milong Ren, Tian Zhu, and Haicang Zhang. Carbonnovo: Joint design of protein structure and sequence using a unified energy-based model. In Forty-first International Conference on Machine Learning, 2024. Richard W Shuai, Jeffrey A Ruffolo, and Jeffrey J Gray. Generative language modeling for antibody design. Bio Rxiv, pp. 2021 12, 2021. Shantanu V Sule, Muppalla Sukumar, William F Weiss, Anna Marie Marcelino-Cruz, Tyler Sample, and Peter M Tessier. High-throughput analysis of concentration-dependent antibody selfassociation. Biophysical journal, 101(7):1749 1757, 2011. Cheng Tan, Zhangyang Gao, Lirong Wu, Jun Xia, Jiangbin Zheng, Xihong Yang, Yue Liu, Bozhen Hu, and Stan Z Li. Cross-gate mlp with protein complex invariant embedding is a one-shot antibody designer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 15222 15230, 2024. Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Lamda: Language models for dialog applications. ar Xiv preprint ar Xiv:2201.08239, 2022. Published as a conference paper at ICLR 2025 Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. ar Xiv preprint ar Xiv:2302.13971, 2023. Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8228 8238, 2024. Joseph L Watson, David Juergens, Nathaniel R Bennett, Brian L Trippe, Jason Yim, Helen E Eisenach, Woody Ahern, Andrew J Borst, Robert J Ragotte, Lukas F Milles, et al. De novo design of protein structure and function with rfdiffusion. Nature, 620(7976):1089 1100, 2023. Sandeep Yadav, Thomas M Laue, Devendra S Kalonia, Shubhadra N Singh, and Steven J Shire. The influence of charge distribution on self-association and viscosity behavior of monoclonal antibody solutions. Molecular pharmaceutics, 9(4):791 802, 2012. Yumeng Yan, Huanyu Tao, Jiahua He, and Sheng-You Huang. The hdock server for integrated protein protein docking. Nature protocols, 15(5):1829 1852, 2020. Jason Yim, Brian L Trippe, Valentin De Bortoli, Emile Mathieu, Arnaud Doucet, Regina Barzilay, and Tommi Jaakkola. Se (3) diffusion model with application to protein backbone generation. ar Xiv preprint ar Xiv:2302.02277, 2023. Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, and Jun Wang. Token-level direct preference optimization. ar Xiv preprint ar Xiv:2404.11999, 2024. Xiangxin Zhou, Dongyu Xue, Ruizhe Chen, Zaixiang Zheng, Liang Wang, and Quanquan Gu. Antigen-specific antibody design via direct energy-based preference optimization. ar Xiv preprint ar Xiv:2403.16576, 2024. Tian Zhu, Milong Ren, and Haicang Zhang. Antibody design using a score-based diffusion model guided by evolutionary, physical and geometric constraints. In Forty-first International Conference on Machine Learning, 2024. Published as a conference paper at ICLR 2025 A.1 NOTATIONS Table 4: Mathematical Symbol Explanations Symbol Description PFC Antibody framework and antigen PCDR CDRs of antibody x R3 Coordinate of Cα atom r SO(3) Rotation matrix a {1, 2, . . . , 20} {[mask]} Amino acid type with mask t Diffusion time T(t) = (x(t), r(t), a(t)) Antibody s structure and sequence at time t T(0:1) = (T(0), T( t), ..., T(1)) Diffusion path sθ Score network pprior Prior distribution pref Reference model pθ Policy model p λ Optimal policy model under λ U Uniform distribution N Gaussian distribution IGSO(3) Isotropic gaussian distribution on SO(3) Cat Categorical distribution R Reward function C Constraint function ω R+ Weight of reward C R+ Threshold of constraint λ R+ Dual variable δ Kronecker delta function DKL Kullback-Leibler divergence P# Projection matrix that removes the center of mass G Dual function D Antigen-antibody complex dataset Dg Antibodies sampled from the reference model β Regularization weight A.2 ADDITIONAL RESULTS A.2.1 REFERENCE-BASED METRICS EVALUATION In this section, we evaluate Ab Novo on reference-based metrics. As shown in Table 5, Ab Novo outperforms all baseline methods. A.2.2 ANTIBODY OPTIMIZATION In this section, we further evaluate Ab Novo on antibody optimization tasks. Here, we specifically compare Ab Novo with the generative model Diff Ab and Ab X. We follow the experimental process proposed by Diff Ab (Luo et al., 2022). This process involves perturbing the CDR sequence and structure at time t using forward diffusion, then denoising from time t to time 0 in reverse diffusion to generate 128 antibodies for each antigen. We also follow the evaluation metrics for antibody Published as a conference paper at ICLR 2025 Table 5: Evaluation of reference-based metrics across each CDR in RAb D test dataset. Method AAR(%) RMSD( A) Method AAR(%) RMSD( A) Diff Ab 70.01 0.88 Diff Ab 61.07 0.85 dy MEAN 75.71 1.09 dy MEAN 75.55 1.03 Geo Ab - - Geo Ab - - Ab X 80.92 0.85 Ab X 80.37 0.80 Ab Novo (base) 85.25 0.66 Ab Novo (base) 84.34 0.66 Ab Novo 84.55 0.65 Ab Novo 83.50 0.65 Diff Ab 38.52 0.78 Diff Ab 58.58 0.55 dy MEAN 68.48 1.11 dy MEAN 83.09 0.66 Geo Ab - - Geo Ab - - Ab X 70.73 0.76 Ab X 84.53 0.45 Ab Novo (base) 78.56 0.61 Ab Novo (base) 88.25 0.32 Ab Novo 76.60 0.62 Ab Novo 88.05 0.35 Diff Ab 28.05 2.86 Diff Ab 47.57 1.39 dy MEAN 37.50 3.88 dy MEAN 52.11 1.44 Geo Ab 41.19 2.57 Geo Ab - - Ab X 44.18 2.50 Ab X 65.89 1.21 Ab Novo (base) 49.93 2.19 Ab Novo (base) 73.88 0.86 Ab Novo 48.55 2.38 Ab Novo 74.45 0.86 optimization used in previous works (Luo et al., 2022; Zhu et al., 2024). We find that Ab Novo outperformed other baseline methods in optimizing Rosetta Binding Energy, Evolutionary Plausibility (Table 6), the proportion of satisfying constraints (Table 7), AAR and RMSD (Table 8). Table 6: Performance of Rosetta Binding Energy (left) and Evolutionary Plausibility (right) across different antibody optimization steps across Diff Ab, Ab X, and Ab Novo. Optimization Steps Diff Ab Ab X Ab Novo 4 -10.45 / 2.39 -8.80 /2.40 -21.02 / 2.39 8 -8.52 / 2.41 -2.64 / 2.43 -19.77 / 2.37 16 -7.18 / 2.42 2.07 / 2.42 -12.70/ 2.37 32 0.23 / 2.53 -3.05 / 2.44 -12.87 / 2.36 64 0.23 / 2.57 3.98 / 2.44 -12.87 / 2.36 100 -0.96 / 2.60 4.79 / 2.44 -12.05 / 2.36 Table 7: Proportion of constraint violations across varying antibody optimization steps. Optimization Steps Diff Ab Ab X Ab Novo 4 13.2 % 14.0 % 12.8 % 8 13.9 % 22.7 % 7.1 % 16 13.6 % 22.5 % 6.5 % 32 15.7 % 21.9 % 4.2 % 64 21.5 % 23.0 % 2.6 % 100 20.8 % 23.5 % 3.9 % Published as a conference paper at ICLR 2025 Table 8: AAR and RMSD across different antibody optimization steps. Optimization Steps Diff Ab Ab X Ab Novo 4 0.88 / 1.09 0.80 / 0.97 0.85 / 0.80 8 0.76 / 1.59 0.59 / 1.51 0.66 / 1.34 16 0.48 / 1.78 0.49 / 1.54 0.51 / 1.46 32 0.39 / 2.05 0.45 / 1.88 0.50 / 1.66 64 0.30 / 2.69 0.45 / 2.33 0.48 / 2.03 100 0.28 / 2.86 0.44 / 2.50 0.49 / 2.38 A.2.3 DESIGN ANTIBODIES FOR UNKNOWN BINDING POSE BETWEEN THE ANTIBODY AND ANTIGEN To enable antibody design in scenarios with unknown antigen-antibody positions and unknown antibody frameworks, we adopted a previously established pipeline (Luo et al., 2022) (https: //github.com/luost26/diffab/tree/main)). Specifically, we used a structure prediction method Chai-1 (Discovery et al., 2024) and docking software HDock (Yan et al., 2020) to predict the relative binding position of the antigen and antibody, followed by utilizing Ab Novo for antibody design. As shown in Table 9, Ab Novo demonstrated superiority among the comparative methods. Table 9: Evaluation of design antibodies for unknown binding pose between the antibody and antigen. AAR RMSD Rosetta Binding Energy Evolutionary Plausibility Constraints dy MEAN 0.37 3.88 -1.7 2.82 94.5 % Ab X 0.40 2.83 11.22 2.54 39.39 % Ab Novo 0.44 2.59 -5.81 2.54 25.5% A.2.4 EVALUATE PRE-TRAINED LANGUAGE MODEL To demonstrate the effectiveness of structure-aware language model, we have evaluated it on CASP14, CASP15, and CAMEO (from 2022-05-01 to 2023-05-01) independent datasets. We follow previous work (Lin et al., 2023) and choose long-range (L > 24) contact prediction accuracy (p@L) as the evaluation metric. The results are shown in Table 10. We observe that structure-aware language model significantly outperforms the sequence pure language model (ESM-2 3B). Table 10: Evaluate pre-trained language model on CASP14, CASP15, CAMEO dataset. Methods CASP14 CASP15 CAMEO ESM2-3B 0.37 0.44 0.51 Ours 0.58 0.59 0.75 A.2.5 CASE STUDIES We analyze the metric distributions of antibodies designed by various methods for specific antigens. dy MEAN is not a generative model and lacks the ability to produce diverse outcomes, its results are excluded from this analysis. As illustrated in Figure 3, Ab Novo has better performance in Rosetta Binding Energy, Evolutionary Plausibility, and the proportion of constraint satisfaction. Published as a conference paper at ICLR 2025 Diff Ab Ab X Ab Novo Natural antibody Antibodies that satisfy all constraints Antibodies that violate constraints Figure 3: Distribution of Rosetta binding energy and evolutionary plausibility for 100 antibodies designed against the different antigens (PDB:4fqj, 1a2y, and 5nuz) using various methods. The red star denotes the characteristics of the natural antibodies. Antibodies that satisfy all constraints (Stability, Self-association, Non-specific Binding) are marked in blue, while those that violate the constraints are marked in red. The yellow regions highlight areas where the binding energy and evolutionary plausibility metrics exceed those of the natural antibody. A.3 ANALYTICAL RESULTS A.3.1 STRONG DUALITY OF CONSTRAINED PREFERENCE OPTIMIZATION This section demonstrates that strong duality holds for the constrained optimization problem (Equation 3). This material is from previous works Ito & Kunisch (2008); Bertsekas (2014); Huang et al. (2024), we include it here only for completeness. The Slater condition is an important concept in primal-dual algorithms and is mainly used to determine a strong duality of a problem. Specifically, the Slater condition requires the existence of a strictly feasible point in the constrained optimization problem, i.e., a point that satisfies all the inequality constraints and all the inequality constraints strictly hold. If Slater s condition is satisfied, it is usually possible to ensure that the optimal solution of the primal problem is equal to the optimal solution of the dual problem, thus achieving strong duality. Published as a conference paper at ICLR 2025 Assumption 1 (Slater condition, Strict feasible) This exists a policy p and ϵ > 0 such that J (C)(p, λ) < ϵ. In practice, this is possible because we often know the optimization strategy s strictly feasible solution, so we can relax the constraints appropriately according to the application scenario Liu et al. (2024). Similarly, in this manner, Ab Novo can satisfy the Slater condition. Proposition 1 (Strong duality of constrained optimization) Under Slater condition (Assumption 1), there is no duality gap for constrained optimization problem. Let p be the optimal primal policy such that p = arg max pθ J (C)(pθ, λ). Let λ to be the optimal dual variable where λ = arg min λ 0 G(λ). Then (p , λ ) is a saddle point of the Langrangian function: max pθ min λ RN + J (L)(pθ, λ) = J (L)(p , λ ) = min λ RN + max pθ J (L)(pθ, λ). (13) Specific proofs can be found in previous works (Liu et al., 2024; Huang et al., 2024). A.3.2 CONTINUOUS REWARD PREFERENCE OPTIMIZATION FOR SCORE-BASED DIFFUSION MODELS In this section, we derive the training objective for the score-based diffusion model using a continuous reward from the objective form of RLHF. Following the approach of previous work (Wallace et al., 2024), we first define a reparameterization of reward. Subsequently, based on this reward, we can obtain the reparameterization of fθ as described in (Chen et al., 2024a). We simplify the notation in this section using q, pθ, and pref to represent qx,r,a, px,r,a θ , and px,r,a ref , respectively. We define rm(T(0:1), PFC) and cn(T(0:1), PFC) as the reward function and the constraint function on the whole trajectory of diffusion process, such that we define Rm(T(0), PFC) and Cn(T(0), PFC) as follows: Rm(T(0), PFC) = ET( t:1) pθ(T( t:1)|T(0),PFC)[rm(T(0:1), PFC)], Cn(T(0), PFC) = ET( t:1) pθ(T( t:1)|T(0),PFC)[cn(T(0:1), PFC)], (14) where T(0:1) = (T(0), T( t), ..., T(1)) means the inference trajectory of diffusion process. Then, we only consider per PFC for simplification and the objective of Ab Novo can be written as follows: max θ ET(0) pθ(T(0)|PFC)) m ωm Rm(T(0), PFC) X n λn Cn(T(0), PFC) + X βDKL(pθ(T(0)|PFC)||pref(T(0)|PFC)). From Equation 7, we can obtain ˆR(T(0), PFC) as ˆR(T(0), PFC) = ET( t:1) pθ(T( t:1)|T(0),PFC)[ˆr(T(0:1), PFC)], (16) where ˆr(T(0:1), PFC) is defined as ˆr(T(0:1), PFC) = X m ωmrm(T(0:1), PFC) X n λncn(T(0:1), PFC). (17) Published as a conference paper at ICLR 2025 Then, the objective of 15 can be transformed to the following form: min θ ET(0) pθ(T(0)|PFC)) h ˆR(T(0), PFC) + X n λn Cn i /β + DKL(pθ(T(0)|PFC)||pref(T(0)|PFC)) min θ ET(0) pθ(T(0)|PFC)) h ˆR(T(0), PFC) + X n λn Cn i /β + DKL(pθ(T(0:1)|PFC)||pref(T(0:1)|PFC)) = min θ ET(0:1) pθ(T(0:1)|PFC)) h ˆr(T(0:1), PFC) i /β + DKL(pθ(T(0:1)|PFC)||pref(T(0:1)|PFC)) 1 = min θ ET(0:1) pθ(T(0:1)|PFC)) log pθ(T(0:1)|PFC) pref(T(0:1)|PFC) exp(ˆr(T(0:1), PFC)/β)/Z(PFC) where Z(PFC) = ET(0:1) pref(T(0:1)|PFC) exp(ˆr(T(0:1), PFC)/β) and Cn is independent of θ. The optimal p θ(T(0:1)|PFC) of Equation 18 has a unique closed-form solution: p θ(T(0:1)|PFC) = pref(T(0:1)|PFC) exp(ˆr(T(0:1), PFC)/β)/Z(PFC). (19) Therefore, we have the reparameterization of ˆr(T(0:1), PFC) as follows: ˆr(T(0:1), PFC) = β log p θ(T(0:1)|PFC) pref(T(0:1)|PFC) + β log Z(PFC). (20) Plug this into the definition of ˆR(T(0), PFC) of Equation 16, hence we have: ˆR(T(0), PFC) = βET( t:1) pθ(T( t:1)|T(0),PFC) log p θ(T(0:1)|PFC) pref(T(0:1)|PFC) + β log Z(PFC). (21) Then by Equation (15) in (Chen et al., 2024a), the fθ can be defined as follows: fθ(T(0), PFC) = ET( t:1) pθ(T( t:1)|T(0),PFC) log pθ(T(0:1)|PFC) pref(T(0:1)|PFC) Here, considering that the computation of Z(PFC) is intractable, we approximate it by ET(0) pref(T(0)|PFC) exp( ˆR(T0, PFC)/β) to simplify the computation. Substituting this reparameterization of fθ into Equation (16) in Chen et al. (2024a), we obtain the objective of diffusion-based NCA: Ldiff NCA(θ) = " exp ( ˆRi/β) j=1 exp ( ˆRj/β) log σ fθ(T(0) i , PFC) + 1 K log σ fθ(T(0) i , PFC) # " exp ( ˆRi/β) j=1 exp ( ˆRj/β) log σ ET( t:1) i pθ(T( t:1) i |T(0) i ,PFC) log pθ(T(0:1) i |PFC) pref(T(0:1) i |PFC) ET( t:1) i pθ(T( t:1) i |T(0) i ,PFC) log pθ(T(0:1) i |PFC) pref(T(0:1) i |PFC) Published as a conference paper at ICLR 2025 Since sampling from pθ(T( t:1) i |T(0) i , PFC) is intractable, we utilize q(T( t:1) i |T(0) i , PFC) for approximathon (Wallace et al., 2024). Ldiff NCA(θ) = " exp ( ˆRi/β) j=1 exp ( ˆRj/β) log σ ET( t:1) i q(T( t:1) i |T(0) i ,PFC) log pθ(T(0:1) i |PFC) pref(T(0:1) i |PFC) ET( t:1) i q(T( t:1) i |T(0) i ,PFC) log pθ(T(0:1) i |PFC) pref(T(0:1) i |PFC) " exp ( ˆRi/β) j=1 exp ( ˆRj/β) log σ ET( t:1) i q(T( t:1) i |T(0) i ,PFC) t { t, ...,1} log pθ(T(t t) i |T(t) i , PFC) pref(T(t t) i |T(t) i , PFC) ET( t:1) i q(T( t:1) i |T(0) i ,PFC) t { t, ...,1} log pθ(T(t t) i |T(t) i , PFC) pref(T(t t) i |T(t) i , PFC) " exp ( ˆRi/β) j=1 exp ( ˆRj/β) log σ ET( t:1) i q(T( t:1) i |T(0) i ,PFC) 1 t Et log pθ(T(t t) i |T(t) i , PFC) pref(T(t t) i |T(t) i , PFC) ET( t:1) i q(T( t:1) i |T(0) i ,PFC) 1 t Et log pθ(T(t t) i |T(t) i , PFC) pref(T(t t) i |T(t) i , PFC) " exp ( ˆRi/β) j=1 exp ( ˆRj/β) log σ 1 t Et ET(t) i q(T(t) i |T(0) i ,PFC) ET(t t) i q(T(t t) i |T(t) i ,T(0) i ,PFC) log pθ(T(t t) i |T(t) i , PFC) pref(T(t t) i |T(t) i , PFC) t Et ET(t) i q(T(t) i |T(0) i ,PFC) ET(t t) i q(T(t t) i |T(t) i ,T(0) i ,PFC) log pθ(T(t t) i |T(t) i , PFC) pref(T(t t) i |T(t) i , PFC) Published as a conference paper at ICLR 2025 By using Jensen s inequality, we have: Ldiff NCA(θ) " exp ( ˆRi/β) j=1 exp ( ˆRj/β) Et ET(t) i q(T(t) i |T(0) i ,PFC) log σ 1 t ET(t t) i q(T(t t) i |T(t) i ,T(0) i ,PFC) log pθ(T(t t) i |T(t) i , PFC) pref(T(t t) i |T(t) i , PFC) K Et ET(t) i q(T(t) i |T(0) i ,PFC) log σ t ET(t t) i q(T(t t) i |T(t) i ,T(0) i ,PFC) log pθ(T(t t) i |T(t) i , PFC) pref(T(t t) i |T(t) i , PFC) = Et E{T(t) i q(T(t) i |T(0) i ,PFC)}1:K " exp ( ˆRi/β) j=1 exp ( ˆRj/β) log σ 1 t ET(t t) i q(T(t t) i |T(t) i ,T(0) i ,PFC) log pθ(T(t t) i |T(t) i , PFC) pref(T(t t) i |T(t) i , PFC) t ET(t t) i q(T(t t) i |T(t) i ,T(0) i ,PFC) log pθ(T(t t) i |T(t) i , PFC) pref(T(t t) i |T(t) i , PFC) With some algebra, the above loss simplifies to Equation 9. A.3.3 OFFLINE DUAL GRADIENT ESTIMATES In this section, we introduce how to use offline data to estimate the gradient of the dual function. From Equation 11, the gradient of the dual function G(λ) can be calculated as follows: dλ = d J (L)(p λ, λ) dp λ dp λ dλ + J (L)(p λ, λ) dλ = d J (L)(p λ, λ) dλ = d J (R)(p λ) J (C)(p λ, λ) = d J (C)(p λ, λ) dλ = EPFC D,T(0) p λ(T(0)|PFC) h C C(T(0), PFC) i , where λ = [λ1, λ2, ..., λn]. After updating policy, the optimal solution of Equation 9 under λ is p λ pref exp( ˆR/β). (27) Thus, Equation 26 could be written as dλ = EPFC Dg h ET(0) pref(T(0)|PFC) exp( ˆR/β)(C C(T(0), PFC)) ET(0) pref(T(0)|PFC) exp( ˆR/β) Then, the gradient of G(λ) could be estimated by offline data, since Equation 28 only correlated by pref. Therefore, we use a large offline dataset Dg = {{Rm(T(0) j,k, PFC,k), Cn(T(0) j,k, PFC,k)}J j=1}K k=1 Published as a conference paper at ICLR 2025 (for K antigens, J antibodies were sampled for each antigen) to estimate the gradient. The specific form can be written as follows: j=1 softmax n ˆRj/β o J j C(T(0) j,k, PFC,k) + Cavg + C, (29) where ˆR is defined as Equation 7, C = [C1, ..., Cn] is the constraint functions, C = [C1, ..., Cn] is the thresholds for different constraints, and Cavg is the global normalization term. Here, we use {.}i to denote the i-th element of the vector. If the current λj is higher, while the model already satisfies the j-th constraint. Thus, a higher λj will result in a greater weight for samples with high Gj, and { d G(λ) dλ }j is larger. Eventually, the gradient of the λj will increase. Conversely, if the current λj is low, while the model appears to violate the j-th constraint. Such a lower λj will result in a smaller weight for samples with low Gj, and { d G(λ) dλ }j will be smaller, and after being normalized, the gradient of the λj will be negative at this step; the next step will increase the λj. A.4 ABNOVO BASE MODEL This section introduces Ab Novo s training strategies, training losses, and inference processes. A.4.1 PRE-TRAINED LANGUAGE MODEL We targeted masking 10% of the protein sequences using the BERT (Lin et al., 2023) model. Specifically, 85% of the masked positions were replaced with the [mask] token, 10% were substituted with random amino acids, and 5% remained unchanged. Additionally, for protein sequences longer than 200 residues, we randomly masked consecutive amino acid segments of length between 5 and 13. Concurrently, we predicted the distance matrix of amino acids using the distogram loss Ldistogram for input sequences containing masked regions. We also employed a contact prediction loss function Lcontact, considering amino acids with distances greater than 8 A as long-distance contacts. The ESM2-3B model (Lin et al., 2023) was used as our model architecture, with its pre-trained weights serving as initialization. A.4.2 DIFFUSION PROCESS DETAILS Following the setup from previous work (Campbell et al., 2024), our noise schedules are independently tailored for each of the three diffusion processes (Table 11). For diffusion computation and parameter settings, we follow previous works (Yim et al., 2023; Zhu et al., 2024). Table 11: Noise schedule settings. Translation β(t) = βmin + t(βmax βmin) βmin = 0.1; βmax = 20 Rotation σ(t) = log(t exp(σmax) + (1 t) exp(σmin)) σmin = 0.01; σmax = 2.25 Sequence α(t) = 1 3(1 t) - A.4.3 ABNOVO MODEL ARCHITECTURE Networks. Ab Novo adopts a network structure similar to ESMFold (Lin et al., 2023), comprising our pre-trained structure-aware language model, a 4-layer Main Trunk, IPA, a structure encoder and a sequence decoder. Our sequence encoder is a structure-aware language model with 33 Transformer layers, while the structure encoder is a Linear layer. The sequence decoder maps the single representation obtained from the Main Trunk to amino acid species, consisting of three layers of MLPs. The specific network dimensions are provided in Table 14. The model architecture of Ab Novo is depicted in Figure 4. Consistent with prior studies (Jumper et al., 2021; Lin et al., 2023; Zhu et al., 2024; Abramson et al., 2024), we find that employing the Published as a conference paper at ICLR 2025 recycling technique can further enhance performance. The recycling features include the predicted antibody sequence and the antibody distance matrix. Structure-aware Language Model Evoformer Trunk Struc Decoder Seq Decoder VH: QV...XXXXXXXX...VSA VL: QV...XXXXXXXX...ELK Antigen: EF...WKELANDV...RLI VH: QV...CARGGGVF...VSA VL: QV...AEDAATYC...ELK Linear Antigen Structure Antibody Framework Structure Noisy CDR Structure Noisy CDR Sequence Antigen Sequence, Antibody Framework sequence Recycling for 3 times Antigen: EF...WKELANDV...RLI Denoised CDR Sequence Antigen Sequence, Antibody Framework sequence Antigen Structure Antibody Framework Structure Denoised CDR Structure Seq Encoder Struc Encoder Figure 4: Encoder and score network of Ab Novo. Input Features. The specific details of the input features are shown in Table 12. Table 12: Input features of Ab Novo. Modules Description of Input Features Sequence Encoder 1. Sequence of antibody framework, antigen, and noisy CDR Structure Encoder 1. Distance matrix of antibody framework, antigen, and noisy CDR 2. Backbone dihedral angles, including ϕ, ψ and ω in sine and cosine form of antibody framework, antigen and noisy CDR Others 1. Chain ID of antibody and antigen 2. Time embedding of Sequence and Structure 3. Relative position embedding Table 13: Network hyper-parameter of Ab Novo. Description Value dimension of single representation in Evoformer Trunk 512 dimension of pair representation in Evoformer Trunk 128 dimension of single representation in IPA 384 dimension of sequence decoder 512 A.4.4 DATASETS Datasets for Structure-aware Language Model. In the training process, we used all the proteins from the Protein Data Bank (PDB) before 2020-01-01 for training, except that we filtered some antibody data. For antibody structure data, we filtered for the CDR H3 region in the training set with more than 40% sequence similarity with the test set. In the training process, we followed the data processing method of Alpha Fold2 (Jumper et al., 2021), and we used MMseqs for 40% sequence similarity clustering. For proteins in Alpha Fold Data Base (AFDB), we masked all amino acids with a p LDDT of less than 50 and removed proteins with an overall p LDDT of less than 70. During the training process, we maintained a crop size 512 and kept the sampling probability of the PDB training data and the training data from the AFDB at 1:4. We used the parameters of the ESM2-3B (Lin et al., 2023) model as initialization. Datasets for Training base model. We follow the list of provided training sets from previous work (Luo et al., 2022) in https://github.com/luost26/diffab/blob/main/data/ sabdab_summary_all.tsv. Datasets for Preference Optimization. Following previous works (Zhou et al., 2024), we construct preference-optimized datasets for antigen and antibody complexes. We generate 512 antibod- Published as a conference paper at ICLR 2025 ies for each antigen in each preference optimization process, and then the metrics are calculated for these antibodies. When updating λ, we randomly selected 10000 (for 50 antigens and 200 of each antibody) antigenantibody complexes to estimate the gradient. A.4.5 TRAINING LOSSES Training Losses for Structure-aware Language Model. We use Mask Language Model Loss (Lin et al., 2023), distogram loss (Jumper et al., 2021), and contact prediction loss to train our base model. The specific loss function is as follows: Lpretrain = 0.5LMLM + 1.0Ldistogram + 1.0Lcontact, Ldistogram = 1 b=1 yb ij log pb ij, Lcontact = 1 b Ω I(dij < 8 A) log pb ij. Distogram loss (Ldistogram) is the prediction of distances between amino acid pairs. Specifically, we divide the distance from 2 A to 22 A into 32 bins and predict in which bin the distance to the amino acid pair lies, where pb ij is the probability of each bin. Contact loss (Lcontact) is the prediction of whether amino acid pairs are in contact or not (whether the distance is less than 8 A or not), where Ω {contact, not contact}. Training Losses for Base Model. We use four loss for training Ab Novo (base) L(x), L(r), L(a), Lviolation and Laux. Following previous works (Yim et al., 2023; Campbell et al., 2024), the denoising score matching (DSM) of different modal can be written as follows: i=1 αt|| r log pt|0(r(t) i |ˆr(0) i ) rpt|0(r(t) i |r(0) i )||2 2, i=1 ||ˆx(0) i x(0) i ||2 2, i=1 Cross Entropy(ˆa(0), a(0)). We also incorporate violation loss from Alpha Fold (Jumper et al., 2021; Abramson et al., 2024) to learn the geometry for inter-residues bonds and avoidance of atom clashes. Specifically, these losses can be written as follows: Lbondlength = 1 Nbonds i=1 max |ℓi design ℓi lit| τbondlength, 0 , Lbondangle = 1 Nangles i=1 max |cosαi design cosαi lit| τbondangle, 0 , Where ℓi design and cosαi design are the bond length and bond angle of i-th designed antibodies, respectively. ℓi lit and cosαi lit is the literature value for this bond length and bond angle. We use same tolerance value (τbondlength and τbondangle) as Alpha Fold. Additionally, we add auxiliary loss for training (Yim et al., 2023). Laux is an mean squared error(MSE) loss supervising the distance of designed CDR s four atoms Ω {C, Cα, N, O}. This can be written as follows: a Ω ||x(t) i,a ˆx(t) i,a||2 2, dclamp . (33) Published as a conference paper at ICLR 2025 Training Losses for Preference Optimization. From Equation 9, F( ) can be written as follows: F( ) = DKL qx,r,a(T(t t) i |T(0,t) i )||px,r,a θ (T(t t) i |T(t) i ) + DKL qx,r,a(T(t t) i |T(0,t) i )||px,r,a ref (T(t t) i |T(t) i ) = αxh DKL qx(x(t t) i |x(0,t) i )||px θ(x(t t) i |x(t) i ) DKL qx(x(t t) i |x(0,t) i )||px ref(x(t t) i |x(t) i ) i αrh DKL qr(r(t t) i |r(0,t) i )||pr θ(r(t t) i |r(t) i ) DKL qr(r(t t) i |r(0,t) i )||pr ref(r(t t) i |r(t) i ) i αah DKL qa(a(t t) i |a(0,t) i )||pa θ(a(t t) i |a(t) i ) DKL qa(a(t t) i |a(0,t) i )||pa ref(a(t t) i |a(t) i ) i , (34) where xi = [xn+1,i, ..., xn+m,i], ri = [rn+1,i, ..., rn+m,i], and ai = [an+1,i, ..., an+m,i], the first subscript represents the i-th sample in the preference optimization and the second subscript represents the j-th amino acid in the CDR. With some algebra, these KL divergences can be derived as the following form: KL Divergence in R3 Space. According to Wallace et al. (2024), the closed-form expressions of KL divergence in R3 simplifies to: D(x) KL = ||f x θ (x(t)) x(0)||2 2 + C. (35) KL Divergence in SO(3) Space. Similar to Zhou et al. (2024), we approximately derive an empirical reconstruction loss in SO(3) as: D(r) KL = || r log pt|0(r(t)|f r θ (r(t))) rpt|0(r(t)|r(0))||2 2 + C. (36) KL Divergence in Discrete Space. For sequences in discrete space, we give the details of the derivation process: D(a) KL = DKL qa(a(t t)|a(0), a(t))||pa θ(a(t t)|a(t)) = Eqa(a(t t)|a(0),a(t)) h log pa θ(a(t t)|a(t)) i + C1 = Eqa(a(t t)|a(0),a(t)) h log X a(0) qa(a(t t)| a(0), a(t))pa θ( a(0)|a(t)) i + C1 = Eqa(a(t t)|a(0),a(t)) h log X qa( a(0)|a(t t))qa(a(t t)|a(t)) qa( a(0)|a(t)) pa θ( a(0)|a(t)) i + C1 Eqa(a(t t)|a(0),a(t)),qa( a(0)|a(t t)) h log qa(a(t t)|a(t)) qa( a(0)|a(t)) pa θ( a(0)|a(t)) i + C1 = Eqa( a(0),a(t t)|a(0),a(t)) h log pa θ( a(0)|a(t)) i + Eqa( a(0),a(t t)|a(0),a(t)) h log qa(a(t t)|a(t)) qa( a(0)|a(t)) = log pa θ(a(0)|a(t)) + C. (37) In practice, we choose [α(x), α(r), α(a)]=[1.0, 0.5, 0.2], α(sup) = 0.5 and K = 8. Meanwhile, we add the regularisation term α(R) for 1 t in Equation 9 to ensure the stability of training. Here, we choose α(R)/ t = 10.0. Published as a conference paper at ICLR 2025 A.4.6 TRAINING DETAILS We show information about the training process, objectives, and learning rate of Ab Novo in Table 14. Especially in the fine-tuning stage, for updating λ for one step, we update the policy for 100 steps. We show details about losses we used when updating policy and λ in 3.3.1 and 3.3.2. We use 8 Nvidia A100 (80G) for training, and the batch size is 128 for all training stages. For all training procedures, we use the Adam optimizer for training with default parameters. Table 14: Hyper-parameter of Ab Novo. Stage Training objective Training steps Learning Rate Dataset Pre-trained LMLM + Ldistogram + Lcontact 200k 5e-5 AFDB (2M)+PDB(filter) Base model 1.0L(x) + 0.5L(r) + 0.2L(a) + 0.1Lviolation + 1.0Laux 20k 1e-4 Antigen-antibody complex Fine-tuning Lupdate policy (Equation 10) 20k 2e-5 Preference dataset A.5 EVALUATION METRICS In this section, we provide detailed descriptions of each metric. Rosetta Binding Energy. For Rosetta Binding Energy, we followed previous work (Zhou et al., 2024) by calculating the energy between the antigen and antibody using Rosetta Software. Lower energy values represent higher antigen-antibody affinity. We first used Rosetta software to perform side-chain packing on the backbone and side chains of the designed antibodies, followed by Fast Relax on the side chains. Then, we calculated the Rosetta Binding Energy. We denote the residue with the index i in the antibody-antigen complex as Ai, then Asc i and Abb i represent the side chain and backbone of the residue respectively. We use EP to represent the interaction energies between Paired residues, which consists of six different energy types: Ehbond, Eatt, Esol, Eelec, Elk, Erep. EPrep(Asc j , Asc j ) + EPrep(Asc j , Abb j ) + 2EPrep(Abb j , Asc j ) + 2EPrep(Abb j , Abb j ) e {hbond,att,elec,lk,rep} EPe(Asc j , Asc j ), EPe(Asc j , Abb j ) (38) The calculation incorporates various energy terms from Rosetta, using the default weights in ref2015 for each term. Evolutionary Plausibility. Evolutionary Plausibility measures how likely a designed sequence is evolutionarily plausible in nature, reflecting adherence to general evolutionary rules of natural proteins. Recent studies show that large-scale protein language models, trained on millions of natural protein sequences, effectively capture these evolutionary rules (Shuai et al., 2021; Hie et al., 2024). Following previous works (Zhu et al., 2024; Zhou et al., 2024), we evaluate Evolutionary Plausibility by calculating the perplexity of the antibody language model (Shuai et al., 2021) for the designed antibodies. Here, we use ai to represent the residue type of designed CDRs. P represents the conditional probability of BERT predicting the word at masked position i, where θ represents the model parameters. The specific formula is as follows: Evolutionary Plausibility = i CDRs log P(ai|ai, ..., ai 1, ai+1, ..., a N) Published as a conference paper at ICLR 2025 We used the script provided by Ig LM Shuai et al. (2021) to calculate this metric. Stability. Stability measures the stability of the conformation of the designed antibody in isolation, without the antigen structure involved. This metric differs from Binding Energy, which evaluates the interaction between the antibody and the antigen. Following the evaluation approach of prior work (Zhou et al., 2024; Li et al., 2024), we used Rosetta to calculate the total energy of the designed antibody to assess its stability. We performed side-chain packing and Fast Relax on the backbone and side chains of the designed antibody. Then we output the total energy of the CDR regions using the Rosetta software. Self-association. The self-association of antibodies refers to the tendency of antibody molecules to bind to each other without interacting with the antigen. Self-association can negatively impact the stability, function, and biological properties of antibodies, especially in the context of therapeutic antibody development, where self-aggregation is generally undesirable (Yadav et al., 2012; Kanai et al., 2008). Previous studies (Makowski et al., 2024) have shown that the occurrence of self-association is closely correlated with the negatively charged patches area in CDRs. Therefore, we use negatively charged patch areas in CDRs as a proxy for evaluating the risk of self-association. Although other physicochemical properties are also correlated with self-association, we used the physicochemical property of CDRs with the highest correlation as the evaluation metric. We used the pipeline provided by previous work (Makowski et al., 2024) (https://github.com/ Tessier-Lab-UMich/CST_Prop_Opt_ML) to calculate the self-association metric. Non-specific Binding. Non-specific binding refers to the undesirable interaction of antibodies with cellular proteins other than the intended target, particularly membrane proteins of the cell (Makowski et al., 2022). In practice, evaluating non-specific binding would require consideration of interactions with all membrane proteins and RNA-specific binding properties. Computationally modeling these interactions for all potential targets is a highly complex and challenging task. Here, we followed the metric proposed in previous works (Makowski et al., 2024; 2022). Previous studies (Makowski et al., 2024; 2022) have verified a correlation between non-specific binding and the hydrophobic patches area in CDRs. Therefore, we use hydrophobic patch areas in CDRs as a proxy for evaluating the risk of non-specific binding. Although other physicochemical properties also have correlation with non-specific binding, we used the most strongly correlated physicochemical property in CDRs as the evaluation metric. We used the pipeline provided by previous work (Makowski et al., 2024) (https://github.com/Tessier-Lab-UMich/CST_ Prop_Opt_ML) to calculate the non-specific binding metric. RMSD. For Diff Ab, Ab X, Geo Ab, and Ab Novo, where the antibody framework structure is provided, we directly calculate the RMSD of the designed antibody. However, for dy MEAN, as there is no given real antibody framework, we use Kabsch alignment to align the designed antigenantibody complex with the natural complex. Then, we calculate the Root Mean Square Deviation (RMSD) for each region of these aligned complexes. Reward and Constraints Settings. Given the stringent requirements for compliance with clinical applications, as well as the inherent biases in hydrophobicity and polarity metrics, our primary objective is to filter out antibodies that clearly violate binding rules while retaining as many potential candidates as possible. Setting an excessively strict threshold would, on the one hand, eliminate many viable candidate antibodies and, on the other hand, destabilize the model s training. Consequently, we heuristically determined the threshold for each constraint based on training stability and the empirical thresholds used in previous studies. Specifically, we determine the thresholds based on the dataset provided in previous work (Makowski et al., 2024). This dataset comprises 80 clinical antibodies annotated with corresponding wet-lab experimental results. For non-specific binding and self-association metrics, we calculate the physicochemical properties of all antibodies that meet the requirements, setting the threshold for constraint violation as twice the highest value of the physicochemical property observed in the antibody that satisfies the wet-lab metric. Published as a conference paper at ICLR 2025 For stability assessment, we evaluated all antibodies in RAb D and set the highest energy value as the threshold for constraint violation. In our experiments, considering the different magnitudes of various rewards and constraints, we normalized all rewards and constraints during the training process. Meanwhile, we set the initial λ to [1.0, 1.0, 1.0] and the reward weights ω1 and ω2 to 1.0 and 1.0. A.6 BASELINE METHODS Diff Ab (Luo et al., 2022). We use the test script and the pre-trained model provided in the Git Hub repository (https://github.com/luost26/diffab/tree/main). All hyper-parameters are default. dy MEAN (Kong et al., 2023). We use the pre-trained model and test script provided in the Git Hub repository (https://github.com/THUNLP-MT/dy MEAN). All hyper-parameters we used are default. Geo Ab (Lin et al., 2024). We employed Geo Ab from its Git Hub repository (https:// github.com/Edapinenut/Geo AB) with all default hyper-parameters and the pre-trained model provided. Ab X (Zhu et al., 2024). We use the co-design test script and pre-trained model provided in the Git Hub repository (https://github.com/Carbon Matrix Lab/Ab X). All hyperparameters we used are default. Ab Diffuser (Martinkus et al., 2024), Ab DPO (Zhou et al., 2024), and Ab Designer (Tan et al., 2024). These models were not available at the time of this paper s submission. Moreover, the results presented in their papers are not compatible with our experiment settings. For example, Ab DPO only designs CDR-H3; Ab Diffuser does not provide six CDRs designed results simultaneously; and Ab Designer uses a different test set. Thus, we do not include their results in our paper.