# protein_design_with_guided_discrete_diffusion__19b59f19.pdf Protein Design with Guided Discrete Diffusion Nate Gruver 1 Samuel Stanton 2 Nathan Frey2 Tim G. J. Rudner1 Isidro Hotzel3 Julien Lafrance-Vanasse3 Arvind Rajpal3 Kyunghyun Cho1,2 Andrew Gordon Wilson1 A popular approach to protein design is to combine a generative model with a discriminative model for conditional sampling. The generative model samples plausible sequences while the discriminative model guides a search for sequences with high fitness. Given its broad success in conditional sampling, classifier-guided diffusion modeling is a promising foundation for protein design, leading many to develop guided diffusion models for structure with inverse folding to recover sequences. In this work, we propose diffusio N Optimized Sampling (NOS), a guidance method for discrete diffusion models that follows gradients in the hidden states of the denoising network. NOS makes it possible to perform design directly in sequence space, circumventing significant limitations of structure-based methods, including scarce data and challenging inverse design. Moreover, we use NOS to generalize La MBO, a Bayesian optimization procedure for sequence design that facilitates multiple objectives and edit-based constraints. The resulting method, La MBO-2, enables discrete diffusions and stronger performance with limited edits through a novel application of saliency maps. We apply La MBO-2 to a realworld protein design task, optimizing antibodies for higher expression yield and binding affinity to several therapeutic targets under locality and developability constraints, attaining a 99% expression rate and 40% binding rate in exploratory in vitro experiments. 1 Introduction Optimizing protein sequences for improved function has the potential for widespread impact [63]. Among many potential applications in engineering and medicine, engineered antibodies can be used to create cancer therapeutics that are much less harmful to the patient than radiotherapy or chemotherapy. Because the space of possible proteins is vast and discrete, and experimental validation is slow and expensive, all practical methods for protein design must restrict themselves to a small enriched library of candidates to find a viable option in as few measurements as possible [44]. In practice these enriched libraries are usually obtained through massive high-throughput in vitro screening [67], or in the case of antibodies by injecting a live animal with the target antigen and sequencing the animal s immune response [52]. Generative protein models offer the tantalizing prospect of enriched libraries produced nearly instantly at a fraction of the cost. Success in real-world applications, however, has proven elusive, in part because naive generative models produce outputs that are similar to their training data and are therefore unlikely to improve target qualities [53]. There are many approaches to guided generation of proteins, but one broad and important distinction is between methods that search in sequence space and those that search in structure space. A basic tenet of molecular biology is sequence determines structure, structure determines function [9]. Hence when optimizing a protein for a desired function, it may seem more direct to design the protein in structure space, where gradient-based sampling methods can be used in tandem with carefully Equal contribution. 1New York University, {2Prescient Design, 3Antibody Engineering} Genentech. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). [HEAVY] EVQLVQSGAEV KGSGYSFGTKLT [LIGHT] VLTSEADYYCQ FGGGTKLTVL [ANTIGEN] KL... diffusio N Optimized Sampling (NOS) Unguided Sampling likelihood likelihood : seed sequences [HEAVY] EVQLVQLGAEV KGSGYSFETKLT [LIGHT] VLTSEADYYCQ FGGGLKLTVL [ANTIGEN] KL... Figure 1: We propose diffusio N Optimized Sampling (NOS), a method for gradient-guided sampling from discrete diffusion models. NOS uses T sampling steps of denoising diffusion, where each step consists of applying a corruption, gradient steps to optimize a value function, f, and sampling of the next discrete sequence, or corresponding latent state. NOS generates samples that optimize an arbitrary objective while maintaining high likelihood with respect to a reference distribution of sequences. We combine NOS with La MBO, a strong Bayesian optimization method for sequence design [73], to make La MBO-2, our improved method for protein design. engineered potentials [1, 45, 77]. One of the drawbacks of this approach is the optimized structure must still be converted back to an amino acid sequence in order to be synthesized, a task known as inverse-folding [19]. There is no guarantee that the optimized structure can be realized by an actual sequence, and the inverse-folding procedure may not find the sequence if it exists. Structural models are also computationally intensive and limited by the scarcity of high-quality structural data. Searching directly in sequence space eliminates the need to recover sequence from structure. Protein sequence models are also comparatively fast, especially during inference, and can leverage sequence datasets that are often several orders of magnitude larger than their structural equivalents. Although sequence models are arguably the most practical foundation for protein design, they have historically suffered from the challenges of optimizing discrete sequences, where gradient-based sampling is not directly applicable. As a result, many sequence search methods resort to gradient-free sampling methods like Metropolis-Hastings MCMC [78, 37], which are flexible but computationally expensive, eroding a key advantage over structure search. Several methods have been proposed that maintain gradient-based search by performing guidance in a continuous latent space, with a learned decoder to sample discrete sequences [33, 32]. Notably, Stanton et al. [73] proposed La MBO (Latent Multi-Objective Bayesian Optimization), an optimization method that uses masked language model (MLM) style denoising guided with Bayesian acquisition values to address the online, multi-objective nature of real-world protein design. While La MBO can quickly sample sequences with improved acquisition value, it has two key limitations. First, La MBO is built on top of MLMs which, while strong representation learners are not strong generative models. In particular, MLMs lag behind other methods in producing high likelihood samples or infills. Second, despite being designed to improve known sequences instead of designing them completely from scratch, La MBO and related methods have no principled framework for simultaneously enforcing an edit budget and choosing optimal edit locations based on that budget. To address the first issue we propose NOS (diffusio N Optimized Sampling), a new method for controllable categorical diffusion (Figure 1). Diffusion models capture complex distributional dependencies by making iterative denoising steps, but there is relatively little previous work on how to control these processes. NOS generates sequences with both high likelihood and desirable qualities by taking many alternating steps between corruption, guidance, and denoising in the continuous latent space of the model. Our in silico validation shows that NOS outperforms many state-of-the-art structure and sequence-based baselines on both unguided and guided infilling tasks.2 To address the second problem (choosing optimal edit locations) we propose using embedding-gradient feature attributions (i.e. saliency maps) to determine which positions on the sequence are most important to edit to improve the guidance objective value. We combine NOS with saliency-selected edits to create La MBO-2, a more powerful variant of the original La MBO algorithm. Exploratory in vitro experimental validation of our designs provides evidence that La MBO-2 can be used to create enriched antibody libraries without the aid of additional in vitro high-throughput screening. 2https://github.com/ngruver/NOS 2 Related Work Discrete diffusions Austin et al. [4] and Hoogeboom et al. [40] constructed diffusion models for discrete categorical data using a categorical noise process. Recently, categorical diffusion has shown promise as a competitor to autoregressive models in text generation for machine translation and summarization. The approaches can be roughly grouped into methods that apply categorical noise distributions directly to sequences (CMLM [31], SUNDAE [65]), and those that apply Gaussian corruptions to token-vector embeddings (SED [74], CDCD [21]). In this work we show that NOS can guide both types of categorical diffusions. Within the space of protein design, our method is also closely related to joint diffusion models over both sequence and structure [2, 51], which also circumvent inverse folding. Because these models still rely on structure information at training time, they can be limited by data availability in the same manner as pure structure models. Discrete generative guidance Gradient guidance typically augments sampling from a generative model with gradient steps to increase the occurrence of a desired attribute [54]. Gradient guidance is natural within the framework of continuous diffusion models [20], and Li et al. [47] use this connection to perform gradient-guided sampling from a diffusion language model. To obtain a continuous space, they perform Gaussian diffusion [39] on word embeddings, decoding out to tokens using a linear head. The original method required many careful engineering interventions, e.g. clamping latent representations to the nearest word embedding, that have been improved by recent methods, such as CDCD [21], but gradient guidance has not been discussed for these more recent formulations. To achieve a similar form of gradient guidance without carefully engineering a latent space, Dathathri et al. [17] and Yang and Klein [83] propose gradient-guided autoregressive models by using the decoder s activations as a gradient-friendly latent space. These methods alternate between sampling from logits and ascending the likelihood of a separately trained classifier model. Surprisingly, despite work on gradient guidance for continuous noise diffusions and autoregressive language models, there has been little work on gradient guidance for general categorical diffusions that predict denoised categorical distributions (e.g. CMLM, SUNDAE, CDCD), which is a topic we explore in detail. One closely related method proposed in the context of generative models of small molecules is Di Gress [79], which performs gradient guidance on one-hot token embeddings to bias the categorical sampling distribution of a denoising model. In our setting we show that categorical and Gaussian discrete diffusions guided with NOS outperform PPLM and Di Gress (Subsec. 5.2). Genetic algorithms Evolutionary algorithms are a popular solution for black-box optimization in discrete spaces [3, 22]. These methods are often evaluated on their ability to optimize in silico proxy estimates of actual in vitro fitness, e.g. deep learning models trained on experimental datasets. In Subsec. 5.3 we baseline NOS against two genetic optimizers from protein design literature, Ada Lead [70] and Proximal Exploration (PEX) [61]. We show these baselines rapidly degrade sequence likelihood as the proxy fitness is improved, limiting the effective number of edits that can be made before checking the actual fitness of the samples, ultimately limiting their sample efficiency and rate of convergence to optimal solutions. By contrast, NOS consistently improves proxy fitness while maintaining sequence likelihood. 3 Background We pose protein design as the problem of finding sequences, w P AL with alphabet A and fixed length L,3 which maximize a single objective fpwq (e.g., binding affinity) or multiple objectives f1pwq, . . . , fkpwq (e.g., expression yield, binding affinity, and aggregation tendency). Designs can be generated from random noise (ab initio design) or by making a fixed number of edits B P t1, . . . , L 1u to a seed sequence s P AL. A protein is only useful if it can be synthesized (i.e. expressed), and the objective value of non-expressing proteins is undefined since their properties cannot be measured. Therefore we must introduce the constraint w P E Ă AL, where E is the set of all expressible proteins. Since naturally occurring sequences must express in order to be observed, ppwq, the likelihood of a protein with respect to an empirical distribution of natural protein sequences, is often taken as a proxy for the tendency of a protein to express. In protein design, these proxies are 3Length change is enabled by the use of protein sequence alignments, which introduce a gap token - . Continuous Noise Process Categorical Noise Process Figure 2: Two approaches to diffusion generative modeling for categorical variables. (Left) Categorical data is embedded into continuous variables with an accompanying continuous noise process. (Right) Categorical noise is applied directly to sequences, and corrupted sequences are denoised using standard language modeling methods. typically called metrics of naturalness. Since we are looking for sequences that by definition have not yet been identified in nature, naturalness and our other objectives are often in tension. We can trade off naturalness and objective value by drawing samples from the unnormalized density ppwq expp Epwqq{Z, Epwq Epwq vpwq, , (1) where Epwq log ppwq is a scalar energy function, and the value function v : AL Ñ R expresses the utility of a sequence with respect to our objectives. When designing proteins from primary sequence, sampling efficiently from the resulting energy function can be challenging. Simple approaches, such as the MCMC sampler used by Verkuil et al. [78] can require hundreds of thousands of steps to converge (Appendix C.2). Guided diffusion models are an appealing alternative because they construct a fixed-length Markov chain that quickly generates low-energy samples. Diffusion models Denoising diffusion models construct samples by reversing a diffusion process that maps clean data points, x0, to samples from a prior πpxq (Figure 2). The forward process (x0 Ñ x T ) is composed of conditional distributions ppxt|xt 1q (i.e., the noise process) that admit closed forms for the conditional distributions ppxt|x0q and ppxt 1|xt, x0q (e.g., independent Gaussian corruption). The reverse process (x T Ñ x0) converts samples from the prior into samples from the learned data distribution pθpx0q by repeatedly predicting the denoised variable ˆx0 from noisy values xt and using the conditional distribution ppxt 1|xt, ˆx0q to derive a transition distribution, pθpxt 1|xtq. The specific choice of noise process has been shown to significantly affect the likelihood and quality of image samples [71]. For categorical data there are two common approaches to constructing a diffusion generative model, depending on the nature of the noise process. We include brief descriptions below and a more detailed account in Appendix A. Continuous noise To learn a distribution ppwq, one strategy is to first embed w to a continuous variable x0 with embedding matrix Uθ and apply Gaussian noise [21]. The prior is taken to be πpxq Np0, Iq while the forward process is ppxt|x0q Npxt; ? αtx0, p1 αtq Iq for αt P r0, 1s. The values of αt are determined by a user-specified corruption schedule. For the reverse process, we learn a function, pθp ˆw|xt, tq, to predict the sequence from noised points xt by minimizing the following loss with respect to θ: Lpθq Ew0,t r log pθpw0|xtqs , xt ppxt|x0 Uθw0q. Using pθp ˆw|xt, tq we can construct a distribution for the reverse process pθpxt 1|xtq ÿ ˆ w p pxt 1|xt, ˆx0 Uθ ˆwq pp ˆw|xt, tq, (2) where ppxt 1|xt, x0q is also a Gaussian distribution. At inference time, we can use the learned reverse process to convert samples from πpxq into samples from the learned distribution pθpx0q by repeatedly sampling pθpxt 1|xtq, followed by sampling w pθp ˆw|x0, 0q. Categorical noise Alternatively, Austin et al. [4] proposed a forward process which operates directly on w, by introducing an absorbing state for each token wpiq [MASK]. The forward process pw0 Ñ w T q is defined by a discrete transition matrix, describing the probability of mutating a token into a [MASK], and the corresponding prior is simply a point mass on the sequence of all [MASK] tokens. To learn the parameters of the denoiser pθp ˆw0|wt, tq we maximize the likelihood of the denoising process on ground truth sequences Lpθq Ew0,t r log pθpw0|wtqs , wt ppwt|w0q Then, as above, we can use the denoiser to construct the reverse process pθpwt 1|wtq ÿ ˆ w0 ppwt 1|wt, ˆw0qpθp ˆw0|wt, tq (3) where ppwt 1|wt, w0q is also a categorical distribution derived using Bayes rule. To sample, the transition distribution is applied for t r T, ..., 0s. Now we present practical methods for efficiently sampling from ppwq 9 ppwq exppvpwqq (Eq. 1) by modifying the learned transition distribution with a learned value function vθpwq. We then show how this sampling method can be used to perform protein design through guided infilling in sequence space. As before, we provide the most salient information below and the full details in Appendix B. 4.1 NOS: diffusio N Optimized Sampling We introduce a general form of gradient guidance (NOS) for discrete diffusions with categorical denoising models, i.e. diffusion models that predict logits over the ground truth tokens (e.g. [21, 4]). The key challenge in applying gradient guidance to categorical data is simply the lack of a continuous representation. Fortunately, in any denoising network, e.g. pθp ˆw|xt, tq, the discrete sequence wt has many corresponding continuous representations in the form of hidden states of the model ht gdpwtq for d P t0, . . . , Du, where D is the depth of the encoder network and g0pwtq Uθwt. Notably, for the Gaussian diffusion models in Sec. 3, we can equivalently have xt g0pwtq, as corruption and sampling are performed on the learned token embeddings. In the case of the categorical noise diffusion pθp ˆw0|wtq pθp ˆw0|htq, and thus for the purpose of guidance, we can consider a general pθp ˆw|htq for both forms of corruption. To sample from pθpwtq 9 pθpwtq exppvθpwtqq, we construct a modified denoising model, pθp ˆw|htq 9 pθp ˆw|htq exppvθphtqq. This formulation requires that the denoising model and the value function share hidden states up to depth d, and that the value function also be trained on corrupted inputs wt. In Appendix D.4 we propose a simple procedure for corrupted discriminative training inspired by label smoothing [76]. Using this modified denoising model we can construct modified transition distributions using Eq. 2 or Eq. 3. There is one key difference between these transition distributions: in the continuous case (Eq. 2), smooth steps are taken in the token embedding space, while in the discrete case (Eq. 3) the transition leads to large jumps from one token embedding to another. In either case, it is possible to sample a discrete sequence w at any point in the chain using the logits of the denoiser pθp ˆw|htq. When using Eq. 2 to derive a continuous transition distribution, we call the method NOS-C, and when using Eq. 3 for discrete transitions, we call the method NOS-D. To sample from the modified transition distribution at each diffusion step, we use Langevin dynamics with temperature τ ą 0, with the update step, h1 t Ð h1 t η h1 trλKLppθp ˆw|h1 tq||pθp ˆw|htqq vθph1 tqs a 2ητε, ε Np0, Iq, (4) where η is the step size and λ is the regularization strength, followed by sampling pθpwt 1|h1 tq or pθpht 1|h1 tq. While the gradient hvθ guides towards high values of the objective, the KL term ensures the resulting transition distribution still maximizes the likelihood of the original prediction. NOS is related to the popular method plug-and-play language model (PPLM), which can be used for gradient-guidance of autoregressive language models [17]. PPLM guides sampling by taking gradient steps similar to Eq. 4 for each autoregressive decoding step (details in Appendix B). Unlike PPLM, NOS is a form of iterative refinement, meaning that tokens across the entire sequence can be modified at each optimization step. This distinction is particularly important for protein design, because function can be determined by complex interactions between distant parts of the sequence. As we see in Sec. 5, NOS leads to better trade-offs between likelihood and objective value. 4.2 La MBO-2: function-guided protein design Many unique challenges arise when applying guided diffusion to real-world protein design tasks. Our approach builds on the La MBO-1 algorithm proposed by Stanton et al. [73], which explicitly EVQLVES - GGGLVQPGGS LRL SCAASG - FN I KD - - - - - T Y I HWVRQAPGKGL EWVAR I Y P T - - - NGY TRYADSVKGRF T I SADTSKNTAY LQMNS LRAEDTAVY YCSRWGGDG - - - - - - - - - - - - - - - - - - - F YAMDYWGQGT LVTVSS 0 Figure 3: An example of a binding affinity saliency map produced by La MBO-2 with NOS-D. For simplicity, only the variable heavy (VH) region of the hu4D5 antibody is shown. Positions corresponding to complementarity defining regions (CDRs) are enclosed in green boxes. Converting this saliency map to an edit position distribution will concentrate computational resources on editing CDRH3, which is often manually selected by experts. Some resources are also allocated to the framework and other CDRs since these positions may also affect binding. accounts for the online, multi-objective nature of protein design by optimizing a multi-objective Bayesian acquisition function. La MBO-2 replaces the guided MLM sampler with NOS, selects edit positions based on acquisition value saliency, and replaces the discriminative deep kernel Gaussian process (GP) with ensemble-based uncertainty quantification. Architecture and value function In order to apply the methods discussed in Subsec. 4.1 we require a generative diffusion model pθpwq and a discriminator ˆfθpwq which share hidden layers up to depth d. The discriminator is trained to predict the objective function f. Like La MBO-1 our architecture consists of many task-specific feature extraction layers that share a bidirectional encoder. Bayesian optimization is an iterative cycle of inference and data acquisition. During the data acquisition phase of any iteration i we need to find sequences with maximal acquisition value vipwq Erupw, f, Diqs, where Di is the data already available and the expectation is taken with respect to a posterior pθpf|Diq and u is some utility function. For multi-objective tasks u is the hypervolume improvement utility function [18], however we note that single-objective tasks are easily accommodated by a different choice of utility function [82]. To estimate the expectation we draw samples from pθpf|Dq with an approach we call partial deep ensembles, where the discriminative layers of the model above the shared encoder are replicated k times and randomly initialized [81]. We provide further details about partial deep ensembles and our learned discriminators in Appendix D.2 and D.3. Choosing edit positions When B ! L encoder-only architectures allow very precise control of edit positions since we will only change positions we corrupt. However, this feature introduces the need for some procedure to choose those positions, ideally where edits will most improve our objective value. We automatically select edit positions by computing the gradient of the value function with respect to h0 to determine which positions affect the value estimate the most (see Figure 3 for an illustration). This method is related to the use of saliency maps to explain the decisions of classifiers [5, 69]. We use input saliency to induce a distribution over edit positions. Specifically, given an embedded sequence h0 we define siph0q, the saliency with respect to v of position i P t1, . . . , Lu as siph0q : max "ˆ ÿ ˇˇˇp hvph0qqij ˇˇˇ 1{τ , ε * , Predit wpiq 0 s siph0q ř j sjph0q, (5) where τ ą 0 is a temperature hyperparameter and 0 ă ε ! 1. As τ Ñ 8, Predit wpiq 0 s 1{L for all i. For each sequence we draw B edit positions without replacement according to Eq. 5. We conserve parts of the input we cannot change (e.g. the antigen sequence) by setting the the saliency to 0 before computing the edit position distribution. Importantly, the diffusion sampling process can also preserve the original values of selected positions when appropriate. If we select a highly conserved position, then the predicted logits will be low entropy and the guidance will incur a large KL penalty for changes (Eq. 4). 5 Experiments We evaluate our methods on three increasingly complex antibody design tasks. First we compare our trained diffusion models on unguided infilling tasks, showing that sequence diffusion methods RFDiffusion Diff Ab Ig LM NOS-C (Ours) NOS-D (Ours) HCDR1 HCDR2 HCDR1+HCDR2+HCDR3 LCDR3 Infill Region Seq. Recovery Figure 4: We infill antibody CDRs with discrete diffusion models (ours) and compare against structure-based diffusion models (Diff Ab [51] and and RFDiffusion [80]) and an autoregressive antibody language model (Ig LM [68]). We see diffusion on sequences alone without structural priors reliable leads to high sequence recovery. For structure based methods, we first fold seed sequences with Ig Fold [64] and then run joint sampling of sequence and structure for the CDR. We sample 10 infills for each of the 10 antibody seed sequences selected randomly from paired OAS [56]. consistently outperform structure-based methods when only predicted structures are available4. We then evaluate NOS by optimizing two objectives that can be simulated effectively in silico. Lastly, we evaluate La MBO-2 on antibody lead optimization, with both in silico and in vitro experiments. 5.1 Unguided antibody CDR infilling We focus on immunoglobulin G (Ig G) format antibodies, which are comprised of a heavy (H) chain and a light (L) chain. Each chain has three complimentarity determining regions (CDRs), which tend to have strong effects on binding affinity to a target antigen but limited effects on other structural properties of the protein. Antibody design methods traditionally focus on proposing mutations to CDRs while leaving the rest of the protein fixed, which can be viewed as an infilling task. We select 10 seeds at random from paired OAS [56] and infill each CDR individually as well as in combination. To evaluate the performance of each model, we measure the sequence recovery rate, which is simply the accuracy of the infilling predictions given the ground truth sequence. As baselines, we include Ig LM [68], a GPT2 language model trained on OAS, and two structure-based methods: Diff Ab [51], a joint sequence-structure diffusion model trained on SAb Dab, and RFDiffusion [80], a structural diffusion model trained on the PDB [10] that uses inverse folding to derive sequences. Although Ig LM is trained with fill-in-the-middle augmentations [7], it does not natively support infilling multiple non-contiguous regions, and we do so by replacing regions that are not yet sampled with [UNK] tokens. For the structure-based methods, we provide starting structures generated with Ig Fold [64]. In Figure 4, we find that diffusion models often generate infills that are on-par or better than that those returned by Ig LM by default, especially when multiple regions must be filled simultaneously. We also see that Diff Ab, while being capable of sequence-structure co-design out of the box, often underperforms sequence-only diffusion, most likely because our sequence-based approaches have access to a larger training dataset, while paired datasets with sequences and structures are much more limited. Lastly RFDiffusion tends to generate relatively low likelihood CDR infills. The gap between Diff Ab and RFDiffusion may be explained by the relative scarcity of antibody structures in the PDB compared to SAb Dab, which has an antibody in every structure. The poor performance of structural methods on CDR infilling could also be a result of poor sequence recovery from structure during inverse folding, a problem that could be amplified for relatively unstructured loop regions like CDRs. 5.2 Optimizing antibodies for in silico objectives To test guided sampling from our model, we run experiments on two simple single-objective tasks: The percentage of beta sheets, measured on primary sequence [15] The solvent accessible surface area (SASA) of the protein s predicted structure [64] 4In practical protein design campaigns it is infeasible to get ground truth structural measurements for proposed designs, and predicted structures are the only alternative available. Seeds NOS-D (Ours) NOS-C (Ours) RFDiffusion Diff Ab PPLM Di Gress % Beta Sheets Prot GPT Log Likelihood % Beta Sheets Prot GPT Log Likelihood Figure 5: Comparing samples from NOS (ours) with alternative guided generation methods and structure-based models. NOS exhibits higher likelihood for similar or dramatically improved values of the objective. (left) Sequence diversification (resampling and selecting improved points) with Diff Ab [51] or RFDiffusion [80]. Diff Ab generates sequences and structures simultaneously, while sequences for RFDiffusion are obtained using Protein MPNN [19]. Compared with NOS, these methods do not effectively optimize the objective and yield low-likelihood sequences. (right) Guided generation using PPLM [17], a guidance method for autoregressive language models (in this case Ig LM [68]) and Di Gress, a competing guidance method for discrete diffusion models [79]. NOS, PPLM, and Di Gress are sampled for many settings of guidance strength (e.g. η and λ (Eq. 4)) to demonstrate the full range of trade-offs between the objective and likelihood. We provide details about hyperparameter settings in Appendix C.5 and additional density plots in Appendix C.6. Since we want plausibly natural antibodies with high objective value we examine the Pareto front for samples optimized for each objective, with log-likelihood assigned by Prot GPT [29] (trained on Uniref50 [75]) plotted against the value of the objective. As an autoregressive guided baseline, we run PPLM, using Ig LM as the base generative model (details in Appendix C.3). We use Di Gress [79] as a guided diffusion baseline. Di Gress uses gradients on one-hot representations to performing guidance in the embedding layer and is thus closely related to our approach. We discuss differences between the methods and the details of our Di Gress experiments in Appendix C.5. For PPLM, Di Gress, and NOS, we generate samples for many different setting of the control hyperparameters (Section 4.1), which yields samples across the spectrum from aggressively optimizing the objective to conservatively maintaining high likelihood. We also include Diff Ab and RFDiffusion without guidance as baselines, as examples of popular diversification procedures, in which new samples are generated for later ranking and filtering. In Figure 5, we see that for both continuous and discrete corruptions NOS offers better trade-offs between optimizing the objective and maintaining high likelihood, while also generating high values of the objective at the extreme. 5.3 Antibody lead optimization: in silico evaluation Having established the performance of NOS on simpler benchmarks, we now turn to real-world antibody design with La MBO-2. From this point forward in all experiments we jointly condition on the heavy chain, light chain, and antigen sequence, and we jointly optimize the heavy and light chains only for improved expression yield and binding affinity to the antigen. As we discussed in Subsec. 4.2, one of the critical subproblems in Bayesian optimization is the identification of high value additions to the existing dataset. In this section we show that La MBO-2 effectively applies NOS to this subproblem in the antibody design setting by finding high acquisition value sequences while preserving naturalness (which we quantify with the metric proposed by Shanehsazzadeh et al. [67]). We focus on optimizing hu4D5, a therapeutic antibody targeting the HER2 antigen5 Comparison with genetic algorithms We first compare La MBO-2 to two discrete optimization baselines, Ada Lead and PEX. We generated 32 designs from the hu4D5 seed with each method, optimizing the same acquisition function derived from the same model. To ensure a fair comparison we limited all methods to a total of 512 model evaluations and a maximum of 2 edits per sampling iteration. We evaluated both the sample acquisition value and naturalness after each iteration. We identified an empirical naturalness threshold below which expression became unreliable and treated this threshold as a simple inequality constraint. Note that this experiment evaluates how each method balances the tradeoff between acquisition value and naturalness as sampling progresses, and does not involve evaluations of the actual black-box objective. 5HER2 is an important target for certain types of breast and stomach cancer [36]. 10 20 30 # Sampling Steps Naturalness 10 20 30 # Sampling Steps Best Acq. Value 2 4 6 8 Diffusion Step 0 10 20 30 Diffusion Step Uniform Edits + No Guidance Uniform Edits + Guidance Salient Edits + No Guidance Salient Edits + Guidance Ada Lead La MBO-2 PEX WJS Figure 6: (left) Naturalness constraints present a challenge for genetic methods, which rapidly decline in naturalness even as their objective value improves. The grey dashed line is an empirical lower bound on naturalness above which in vitro expression is reliable. Although Ada Lead and PEX both improve the acquisition value, they quickly leave well-supported areas of the search space (drop below the dashed line), shown by the faded section of each curve. By contrast, the naturalness of La MBO-2 samples degrades much more slowly while consistently improving the acquisition value. (right) Ablating the effects of guidance and edit position selection. We start with the hu4D5 HER2 antibody and vary the edit budget B P t8, 32u, optimizing for expression yield and binding affinity. For all choices of edit budget, we find that the effect size of edit position selection is much larger than that of guidance, making salient unguided edits a surprisingly strong baseline. La MBO-2 strictly dominates PEX in terms of naturalness and acquisition value at every sampling iteration, with PEX producing infeasible samples beyond 4 iterations. Ada Lead improved sample value the most rapidly of all methods in this experiment, but also degrades naturalness the fastest, violating the constraint after only 2 sampling iterations. In contrast La MBO-2 samples satisfy the naturalness constraint out to 16 sampling steps, producing the highest value feasible solutions. This result highlights the importance of accounting for distributional constraints when optimizing empirical proxies of fitness, since the quality of the proxy signal degrades rapidly outside the support of the training data. Genetic algorithms easily hack empirical models by leaving the support of natural sequences, where training data is necessarily absent, leading to poor quality solutions that nevertheless attain high acquisition value. In Appendix D.5 we show that both sequence and structure-based unguided infilling (i.e. random hit diversification) has the opposite behavior, producing samples with reasonable naturalness but low acquisition value. Effect of salient edits To separate and independently study the effects of guidance (NOS) and salient position selection, we present an ablation in Figure 6 for optimization with a relatively small edit budget B (B ă 10% of mutable positions). To isolate the effects of salient edits we baseline against edit positions chosen uniformly at random, and to isolate the effects of guidance we set the step size η (Eq. 4) to 0. Small edit distance constraints are common in antibody engineering because the goal is typically to increase binding affinity without altering the binding location on the antigen (i.e. the engineered antibody should bind to the same epitope) [43]. One heuristic way to constrain the design to the same epitope is to set B 8, (about 2.7% of the antibody sequence length) [43], precisely the range we consider in Figure 6. In the few edit regime we find that while both interventions improve sample objective value, selecting positions using saliency has a much larger effect than guidance. Although gradient guidance is a reliable and generally applicable tool for improved sampling, the scale of the edit position search dominates the scale of the search over token replacements that guidance affects. With a vocabulary of 21 tokens the number of possible token combinations (218) is dwarfed by the combinations of possible edit positions (C300 8 ). Salient selection of edit positions is, therefore, key to any practical application of NOS in budget-constrained design. Interestingly, this facet of protein design differs significantly from guided sampling of images, where generation is typically limited to fixed locations [50, 14], not a fixed edit budget spread over any location that will optimize the objective. These additional degrees of freedom pose an extra challenge. 5.4 Antibody lead optimization: in vitro evaluation As our final evaluation in Figure 7 we present results using La MBO-2 (specifically with NOS-D) to optimize 20 seed antibodies distributed across 4 different therapeutic target antigens, including Ab Sci (2023) La MBO-2 (Ours) Round 1 Round 2 Round 3 Expression Rate Round 1 Round 2 Round 3 Yield Improved Round 1 Round 2 Round 3 Binding Rate Round 1 Round 2 Round 3 Binding Improved Figure 7: We use La MBO-2 to optimize 20 seed antibodies for 4 different target antigens over three experimental rounds, retraining the model after each round. Some design choices and hyperparameters varied from round to round, with substantial impact on the results. In the last round we tested 56 antibodies and attained a 99% expression rate and 40% binding rate on average across targets. On average 43% of the expressing designs had higher yield and 21% of binding designs had higher binding affinity than the corresponding seed. These results are very encouraging when placed in context with a related experiment designing HER2 antibody libraries [67]. Our results provide evidence that enriched antibody libraries can be created in silico without the assistance of highthroughput in vitro screening. hu4D5/HER26. We tested 374 designs in total over three rounds, retraining the model after each round and varying a range of design choices and hyperparameters. While expression and binding performance varied from round to round across seeds and targets, by the final round we were able to generate multiple submicromolar binders for all 4 targets with a median of 5 edits to the seed. See Appendix D.7 for individual yield and affinity measurements and experimental details. The improvements to yield and affinity over time can be attributed both to the data added to the training corpus and methodological insights gleaned after each round. For example, the sharp drop in expression in Round 2 can mainly be attributed to framework residue deletions that arose when λ (the KL penalty coefficient) was set too small. In the following round we tried a range of larger λ values and fixed the sequence lengths and expression immediately recovered. Figure 7 also reports binding affinity results of a related experiment from Shanehsazzadeh et al. [67] for context, though we emphasize that there are substantial differences between our wetlab validation and that of Shanehsazzadeh et al. [67] which prevent a true apples-to-apples comparison. In the latter experiment 1M designs were generated for the HER2 target and screened with a high-throughput assay. After screening 421 designs were validated with a high-fidelity surface plasmon resonance (SPR) assay. In addition to wetlab screening, their experiment also restricted edits to specific antibody CDRs. We optimized antibodies for a range of targets including HER2 and relied exclusively on in silico screening before validating with SPR, while placing no explicit restrictions on the edit locations. Despite these differences, our results provide initial evidence that it is possible to generate enriched libraries of antibody designs exclusively with in silico methods operating only on primary sequence. While the experimental validation provided is preliminary, we are actively pursuing more rigorous experimental testing in the form of up-scaled and repeated expression and binding experiments and specificity assessment. 6 Discussion There are many exciting directions for future work. The original La MBO algorithm was used to optimize small molecules in addition to proteins, and applying La MBO-2 to small molecule design is a fruitful direction, as La MBO-2 s improvements are not protein-specific. While sequence alignments are a convenient solution to the length change problem in protein design, padding methods [47] or diffusion with a variable-length corruption process (e.g. [60]) will be needed for applications like small molecules which do not admit alignments. We are also eager to consider optimizing much longer sequences, such as gene perturbations [42], which can exceed 20K tokens in length and may necessitate the use of recent advancements such as implicit convolutions [35, 57, 59] or clever modifications of self-attention [16, 13, 48]. More general notions of guidance such as classifier-free guidance [38] for text or class-conditional generation [62, 12] are another intriguing direction, since some goals are difficult to express as black-box functions or constraints [49, 58]. 6Due to the sensitive nature of the data, we do not disclose the other seeds or drug targets. [1] Rebecca F Alford, Andrew Leaver-Fay, Jeliazko R Jeliazkov, Matthew J O Meara, Frank P Di Maio, Hahnbeom Park, Maxim V Shapovalov, P Douglas Renfrew, Vikram K Mulligan, Kalli Kappel, et al. The rosetta all-atom energy function for macromolecular modeling and design. Journal of chemical theory and computation, 13(6):3031 3048, 2017. [2] Namrata Anand and Tudor Achim. Protein structure and sequence generation with equivariant denoising diffusion probabilistic models. ar Xiv preprint ar Xiv:2205.15019, 2022. [3] Charles Audet. A survey on direct search methods for blackbox optimization and their applications. Springer, 2014. [4] Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981 17993, 2021. [5] David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert Müller. How to explain individual classification decisions. The Journal of Machine Learning Research, 11:1803 1831, 2010. [6] Jasmijn Bastings, Sebastian Ebert, Polina Zablotskaia, Anders Sandholm, and Katja Filippova. " will you find these shortcuts?" a protocol for evaluating the faithfulness of input salience methods for text classification. ar Xiv preprint ar Xiv:2111.07367, 2021. [7] Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine Mc Leavey, Jerry Tworek, and Mark Chen. Efficient training of language models to fill in the middle. ar Xiv preprint ar Xiv:2207.14255, 2022. [8] Prajjwal Bhargava, Aleksandr Drozd, and Anna Rogers. Generalization in nli: Ways (not) to go beyond simple heuristics, 2021. [9] Carl Ivar Branden and John Tooze. Introduction to protein structure. Garland Science, 2012. [10] Stephen K Burley, Helen M Berman, Gerard J Kleywegt, John L Markley, Haruki Nakamura, and Sameer Velankar. Protein data bank (pdb): the single global macromolecular structure archive. Protein crystallography: methods and protocols, pages 627 641, 2017. [11] Stephen Casper, Yuxiao Li, Jiawei Li, Tong Bu, Kevin Zhang, and Dylan Hadfield-Menell. Benchmarking interpretability tools for deep neural networks. ar Xiv preprint ar Xiv:2302.10894, 2023. [12] Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. ar Xiv preprint ar Xiv:2301.00704, 2023. [13] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. ar Xiv preprint ar Xiv:1904.10509, 2019. [14] Hyungjin Chung, Byeongsu Sim, Dohoon Ryu, and Jong Chul Ye. Improving diffusion models for inverse problems using manifold constraints. ar Xiv preprint ar Xiv:2206.00941, 2022. [15] Peter JA Cock, Tiago Antao, Jeffrey T Chang, Brad A Chapman, Cymon J Cox, Andrew Dalke, Iddo Friedberg, Thomas Hamelryck, Frank Kauff, Bartek Wilczynski, et al. Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11):1422 1423, 2009. [16] Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344 16359, 2022. [17] Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation. ar Xiv preprint ar Xiv:1912.02164, 2019. [18] Samuel Daulton, Maximilian Balandat, and Eytan Bakshy. Differentiable Expected Hypervolume Improvement for Parallel Multi-Objective Bayesian Optimization. Advances in Neural Information Processing Systems, 33, 2020. [19] Justas Dauparas, Ivan Anishchenko, Nathaniel Bennett, Hua Bai, Robert J Ragotte, Lukas F Milles, Basile IM Wicky, Alexis Courbet, Rob J de Haas, Neville Bethel, et al. Robust deep learning based protein sequence design using proteinmpnn. Science, 378(6615):49 56, 2022. [20] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780 8794, 2021. [21] Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, et al. Continuous diffusion for categorical data. ar Xiv preprint ar Xiv:2211.15089, 2022. [22] Benjamin Doerr and Frank Neumann. Theory of evolutionary computation: Recent developments in discrete optimization. 2019. [23] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011. [24] James Dunbar and Charlotte M Deane. Anarci: antigen receptor numbering and receptor classification. Bioinformatics, 32(2):298 300, 2016. [25] James Dunbar, Konrad Krawczyk, Jinwoo Leem, Terry Baker, Angelika Fuchs, Guy Georges, Jiye Shi, and Charlotte M Deane. Sabdab: the structural antibody database. Nucleic acids research, 42(D1):D1140 D1146, 2014. [26] Patrick Emami, Aidan Perreault, Jeffrey Law, David Biagioni, and Peter St John. Plug & play directed evolution of proteins with gradient-based discrete mcmc. Machine Learning: Science and Technology, 4(2):025014, 2023. [27] Michael Emmerich. Single-and multi-objective evolutionary design optimization assisted by gaussian random field metamodels. dissertation, Universität Dortmund, 2005. [28] Michael TM Emmerich, André H Deutz, and Jan Willem Klinkenberg. Hypervolume-based expected improvement: Monotonicity properties and exact computation. In 2011 IEEE Congress of Evolutionary Computation (CEC), pages 2147 2154. IEEE, 2011. [29] Noelia Ferruz, Steffen Schmidt, and Birte Höcker. Protgpt2 is a deep unsupervised language model for protein design. Nature communications, 13(1):1 10, 2022. [30] Nathan C. Frey, Dan Berenberg, Joseph Kleinhenz, Isidro Hotzel, Julien Lafrance-Vanasse, Ryan Lewis Kelly, Yan Wu, Arvind Rajpal, Stephen Ra, Richard Bonneau, Kyunghyun Cho, Andreas Loukas, Vladimir Gligorijevic, and Saeed Saremi. Learning protein family manifolds with smoothed energy-based models. In ICLR 2023 Physics4ML Workshop, 2023. URL https://openreview.net/forum?id=Iiln B8jfo P9. Spotlight presentation. [31] Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. Mask-predict: Parallel decoding of conditional masked language models. ar Xiv preprint ar Xiv:1904.09324, 2019. [32] Vladimir Gligorijevic, Daniel Berenberg, Stephen Ra, Andrew Watkins, Simon Kelow, Kyunghyun Cho, and Richard Bonneau. Function-guided protein design by deep manifold sampling. bio Rxiv, 2021. [33] Rafael Gómez-Bombarelli, Jennifer N Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2):268 276, 2018. [34] Alexandros Graikos, Nikolay Malkin, Nebojsa Jojic, and Dimitris Samaras. Diffusion models as plug-and-play priors. ar Xiv preprint ar Xiv:2206.09012, 2022. [35] Albert Gu, Karan Goel, Ankit Gupta, and Christopher Ré. On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems, 35:35971 35983, 2022. [36] Carolina Gutierrez and Rachel Schiff. Her2: biology, detection, and clinical implications. Archives of pathology & laboratory medicine, 135(1):55 62, 2011. [37] Brian Hie, Salvatore Candido, Zeming Lin, Ori Kabeli, Roshan Rao, Nikita Smetanin, Tom Sercu, and Alexander Rives. A high-level programming language for generative protein design. bio Rxiv, pages 2022 12, 2022. [38] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. ar Xiv preprint ar Xiv:2207.12598, 2022. [39] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020. [40] Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34:12454 12465, 2021. [41] Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, and Been Kim. Evaluating feature importance estimates. web preprint https://research.google/pubs/pub47088/, 2018. [42] Yuge Ji, Mohammad Lotfollahi, F Alexander Wolf, and Fabian J Theis. Machine learning for perturbational single-cell omics. Cell Systems, 12(6):522 537, 2021. [43] Simon Kelow, Bulat Faezov, Qifang Xu, Mitchell I Parker, Jared Adolf-Bryfogle, and Roland L Dunbrack Jr. A penultimate classification of canonical antibody cdr conformations. bio Rxiv, pages 2022 10, 2022. [44] H Benjamin Larman, George Jing Xu, Natalya N Pavlova, and Stephen J Elledge. Construction of a rationally designed antibody platform for sequencing-assisted selection. Proceedings of the National Academy of Sciences, 109(45):18523 18528, 2012. [45] Jin Sub Lee and Philip M Kim. Proteinsgm: Score-based generative modeling for de novo protein design. bio Rxiv, 2022. [46] Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. ar Xiv preprint ar Xiv:1805.00909, 2018. [47] Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation. ar Xiv preprint ar Xiv:2205.14217, 2022. [48] Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context. ar Xiv preprint ar Xiv:2310.01889, 2023. [49] Shengchao Liu, Yutao Zhu, Jiarui Lu, Zhao Xu, Weili Nie, Anthony Gitter, Chaowei Xiao, Jian Tang, Hongyu Guo, and Anima Anandkumar. A text-guided protein design framework. ar Xiv preprint ar Xiv:2302.04611, 2023. [50] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461 11471, 2022. [51] Shitong Luo, Yufeng Su, Xingang Peng, Sheng Wang, Jian Peng, and Jianzhu Ma. Antigenspecific antibody design and optimization with diffusion-based generative models. bio Rxiv, 2022. [52] Pascale Mathonet and Christopher G Ullman. The application of next generation sequencing to the understanding of antibody repertoires. Frontiers in immunology, 4:265, 2013. [53] Igor Melnyk, Payel Das, Vijil Chenthamarakshan, and Aurelie Lozano. Benchmarking deep generative models for diverse antibody sequence design. ar Xiv preprint ar Xiv:2111.06801, 2021. [54] Anh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovitskiy, and Jason Yosinski. Plug & play generative networks: Conditional iterative generation of images in latent space. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4467 4477, 2017. [55] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162 8171. PMLR, 2021. [56] Tobias H Olsen, Fergus Boyles, and Charlotte M Deane. Observed antibody space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Science, 31(1):141 146, 2022. [57] Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. ar Xiv preprint ar Xiv:2303.06349, 2023. [58] Vishakh Padmakumar, Richard Yuanzhe Pang, He He, and Ankur P Parikh. Extrapolative controlled sequence generation via iterative refinement. ar Xiv preprint ar Xiv:2303.04562, 2023. [59] Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional language models. ar Xiv preprint ar Xiv:2302.10866, 2023. [60] Machel Reid, Vincent J Hellendoorn, and Graham Neubig. Diffuser: Discrete diffusion via edit-based reconstruction. ar Xiv preprint ar Xiv:2210.16886, 2022. [61] Zhizhou Ren, Jiahan Li, Fan Ding, Yuan Zhou, Jianzhu Ma, and Jian Peng. Proximal exploration for model-guided protein sequence design. In International Conference on Machine Learning, pages 18520 18536. PMLR, 2022. [62] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684 10695, 2022. [63] Philip A Romero and Frances H Arnold. Exploring protein fitness landscapes by directed evolution. Nature reviews Molecular cell biology, 10(12):866 876, 2009. [64] Jeffrey A Ruffolo and Jeffrey J Gray. Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies. Biophysical Journal, 121(3):155a 156a, 2022. [65] Nikolay Savinov, Junyoung Chung, Mikolaj Binkowski, Erich Elsen, and Aaron van den Oord. Step-unrolled denoising autoencoders for text generation, 2021. [66] Harshay Shah, Prateek Jain, and Praneeth Netrapalli. Do input gradients highlight discriminative features? Advances in Neural Information Processing Systems, 34:2046 2059, 2021. [67] Amir Shanehsazzadeh, Sharrol Bachas, Matt Mc Partlon, George Kasun, John M Sutton, Andrea K Steiger, Richard Shuai, Christa Kohnert, Goran Rakocevic, Jahir M Gutierrez, et al. Unlocking de novo antibody design with generative artificial intelligence. bio Rxiv, pages 2023 01, 2023. [68] Richard W Shuai, Jeffrey A Ruffolo, and Jeffrey J Gray. Generative language modeling for antibody design. bio Rxiv, 2021. [69] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. ar Xiv preprint ar Xiv:1312.6034, 2013. [70] Sam Sinai, Richard Wang, Alexander Whatley, Stewart Slocum, Elina Locane, and Eric D Kelsic. Adalead: A simple and robust adaptive greedy search algorithm for sequence design. ar Xiv preprint ar Xiv:2010.02141, 2020. [71] Raghav Singhal, Mark Goldstein, and Rajesh Ranganath. Where to diffuse, how to diffuse, and how to get back: Automated learning for multivariate diffusions. ar Xiv preprint ar Xiv:2302.07261, 2023. [72] Suraj Srinivas, Kyle Matoba, Himabindu Lakkaraju, and François Fleuret. Efficient training of low-curvature neural networks. Advances in Neural Information Processing Systems, 35: 25951 25964, 2022. [73] Samuel Stanton, Wesley Maddox, Nate Gruver, Phillip Maffettone, Emily Delaney, Peyton Greenside, and Andrew Gordon Wilson. Accelerating bayesian optimization for biological sequence design with denoising autoencoders. ar Xiv preprint ar Xiv:2203.12742, 2022. [74] Robin Strudel, Corentin Tallec, Florent Altché, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre, et al. Self-conditioned embedding diffusion for text generation. ar Xiv preprint ar Xiv:2211.04236, 2022. [75] Baris E Suzek, Hongzhan Huang, Peter Mc Garvey, Raja Mazumder, and Cathy H Wu. Uniref: comprehensive and non-redundant uniprot reference clusters. Bioinformatics, 23(10):1282 1288, 2007. [76] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818 2826, 2016. [77] Brian L Trippe, Jason Yim, Doug Tischer, Tamara Broderick, David Baker, Regina Barzilay, and Tommi Jaakkola. Diffusion probabilistic modeling of protein backbones in 3d for the motif-scaffolding problem. ar Xiv preprint ar Xiv:2206.04119, 2022. [78] Robert Verkuil, Ori Kabeli, Yilun Du, Basile IM Wicky, Lukas F Milles, Justas Dauparas, David Baker, Sergey Ovchinnikov, Tom Sercu, and Alexander Rives. Language models generalize beyond natural proteins. bio Rxiv, 2022. [79] Clement Vignac, Igor Krawczuk, Antoine Siraudin, Bohan Wang, Volkan Cevher, and Pascal Frossard. Digress: Discrete denoising diffusion for graph generation. ar Xiv preprint ar Xiv:2209.14734, 2022. [80] Joseph L Watson, David Juergens, Nathaniel R Bennett, Brian L Trippe, Jason Yim, Helen E Eisenach, Woody Ahern, Andrew J Borst, Robert J Ragotte, Lukas F Milles, et al. Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models. bio Rxiv, pages 2022 12, 2022. [81] Andrew G Wilson and Pavel Izmailov. Bayesian deep learning and a probabilistic perspective of generalization. Advances in neural information processing systems, 33:4697 4708, 2020. [82] James T Wilson, Riccardo Moriconi, Frank Hutter, and Marc Peter Deisenroth. The reparameterization trick for acquisition functions. ar Xiv preprint ar Xiv:1712.00424, 2017. [83] Kevin Yang and Dan Klein. Fudge: Controlled text generation with future discriminators. ar Xiv preprint ar Xiv:2104.05218, 2021. Table of Contents A Extended Background 16 A.1 Continuous noise diffusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.2 Categorical noise diffusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 B Methodological Details 17 B.1 Infilling algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 B.2 Hidden State Langevin Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 18 C Infilling / NOS Guidance 19 C.1 Infilling experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 C.2 MCMC comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 C.3 PPLM details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 C.4 Model Architecture and Training . . . . . . . . . . . . . . . . . . . . . . . . . 20 C.5 Hyperparameter settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 C.6 Density plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 D La MBO-2 23 D.1 Intro to Multi-Objective Bayesian Optimization . . . . . . . . . . . . . . . . . . 23 D.2 Discrete EHVI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 D.3 Architecture and Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . 24 D.4 Training Data, Class Imbalance, and Label Smoothing . . . . . . . . . . . . . . 25 D.5 Baselining La MBO-2 Against Unguided Sequence and Structure-Based Diversification: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 D.6 Are Saliency Maps Reliable? . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 D.7 Wetlab Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 A Extended Background In this section we provide full descriptions of the diffusion processes introduced in Sec. 3. A.1 Continuous noise diffusion The forward process is defined by noise variances β. We use the cosine variance schedule from Nichol and Dhariwal [55]. For convenience we further define αt 1 βt, αt The forward process is defined by the conditional distributions ppxt|xt 1q Npxt; a 1 βtxt 1, βt Iq ppxt|x0q Npxt; ? αtx0, p1 αtq Iq ppxt|wq Npxt; ? αt Uθw, p1 αtq Iq This is a sample string This is a [MASK] string This [MASK] a [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] Reverse process Forward process Figure 8: Illustration of a string gradually corrupted by [MASK] tokens. where Uθ is an embedding matrix. The reverse process is defined by πpxq Np0, Iq ppxt 1|xt, x0q N xt 1; µt, σ2 t I µt ? αt 1βt 1 αt x0 ?αtp1 αt 1q σ2 t 1 αt 1 pθpw|xtq Softmaxpϕθpx0qq pθpxt 1|xtq ÿ ˆ w ppxt 1|xt, x0 Uθ ˆwqpθp ˆw|xtq A.2 Categorical noise diffusion Following Austin et al. [4] we define the MLM style categorical diffusion using transition matrices 1 if i j m αt if j m, i m 1 αt if i j m and Qt Q1Q2...Qt for noise schedule αt P r0, 1s (see Figure 8 for an illustration). These transition matrices correspond to categorical conditional distributions ppwt|wt 1q Catpwt; p wt 1Qtq ppwt|w0q Catpwt; p w0 Qtq The reverse process is defined by πpwq 1rw [MASK]Ls ppwt 1|wt, w0q Cat ˆ wt 1; p wt QJ t d w0 QJ t 1 w0 Qtw J t pθpw0|wtq Softmaxpϕθpwtqq pθpwt 1|wtq ÿ ˆ w0 ppwt 1|wt, ˆw0qpθp ˆw0|wtq B Methodological Details B.1 Infilling algorithm We sample infills using the procedure in Algorithm 1. The infill mask P is constructed by setting the index of conserved residue equal to 1, in this case at every residue that is not included in set of CDR regions being infilled. We use the same algorithm to perform the guided infilling in Subsec. 5.2, where it is extended with a guidance Langevin sampling step. Algorithm 1 Infilling with categorical denoising diffusion model Inputs: Denoiser pθp ˆw|xt, tq, corruption process ppxt|x0q, infilling mask P, and seed sequence s Returns: Sample from ppwq pθpw|P, sq exppfpwqq x T ppx T q s T pps T |sq x T Ð p I P JPqx T P Js T for t T, . . . , 1 do ppxt 1|xtq Ð ř ˆ w ppxt 1|xt, ˆwqpθp ˆw|xt, tq xt ppxt 1|xtq st ppst|sq xt Ð p I P JPqxt P Jst end w pθpw|x0q return w B.2 Hidden State Langevin Sampling Design of molecules or images with generative models is often posed as the problem of sampling from a posterior distribution ppx|aq given the unconditional distribution ppxq and attribute model ppa|xq. Indeed, reinforcement learning, the design of good actions in an environment, can also be framed as posterior sampling where ppa|xq is the probability that a given state or state-action pair is optimal [46]. Methods that employ posterior sampling of this form are often call plug-and-play because ppa|xq and ppxq need not share parameters and therefore users can mix and match different instantiations [54, 17, 34, 26] The most common way to sample from the posterior ppx|aq 9 ppa|xqppxq is through Langevin sampling on the unnormalized joint density ppa, xq ppa|xqppxq, with sampling steps xi 1 xi η log ppa, xq a 2ηzi, zi Np0, Iq xi η p log ppa|xq log ppxqq a 2ηzi, zi Np0, Iq When we work with generative models over continuous random variables that permit a likelihood (e.g. normalizing flows), score function (e.g. diffusions), or energy (e.g. EBMs) log ppxq has a natural interpretation and sampling can be performed with essentially vanilla Langevin sampling. In other cases where only a denoising function over continuous variables is available, authors have proposed approximate samplers using an approximation of the score function [54]. When we instead hope to sample from a posterior over discrete random variables constructing an analogy to the score function log ppxq is challenging, and prior work adopts a different approach of regularizing the conditional sampling distribution ppw|aq with unconditional sampling ppwq in order to maintain high likelihood [17]. In autoregressive models, ppwq is broken down using the chain rule, ppwt|wătq and thus the appropriate regularization is KLpppwt|wătq || ppwt|wăt, aqq (6) In our case, the distribution ppwq is factorized by the transition distributions ppwt|wt 1q (or their continuous analogies in token embedding space), and we hope to sample from the perturbed transition ppwt 1|wtq pθpwt 1|wtq exppvθpwtqq The correct regularization term in our case is thus KLpppwt 1|wtq || ppwt 1|wt, aqq To put the pieces together, we first recognize that the denoising model pθpw0|wtq can be broken down into an language model head, Hθ, and trunk, Tθ, with ht Tθpwtq pθpw0|wtq Hθpw0|htq We can then perform Langevin sampling on the hidden representations, initializing with ht, as shown in Algorithm 2. In the experiments above we set λ3 0, as we saw no noticable benefit from adding additional stochasticity. Importantly, sampling from ppwt 1|wtq already introduces randomness into the reverse process. Algorithm 2 Guided diffusion sampler Inputs: Denoiser pθp ˆw|xt, tq r Tθ, Hθs, value function vθ, and weights λ1, λ2, λ3 Returns: Sample from ppwq ppwq exppfpwqq w T [MASK]L for t T, . . . , 1 do ppwt 1|wtq Ð ř ˆ w ppwt 1|wt, ˆwqpθp ˆw|wtq h0 Ð Tθpwtq for i 0, . . . , K 1 do zi Np0, Iq ph Ð ř ˆ w ppwt 1|wt, ˆwq Hθp ˆw|hiq hi 1 Ð hi λ1 hvθphiq λ2 h KLpppwt 1|wtq||phq λ3zi end wt 1 Hθph Kq end return w0 C Infilling / NOS Guidance All of our diffusion models are train on all paired heavy and light chain sequences from OAS [56] (p OAS) combined with all sequences from SAb Dab [25], aligned with ANARCI [24]. C.1 Infilling experiment For our trained diffusion models, we use Algorithm 1 without guidance, generating P based on the indicated CDRs, using chothia numbering for consistency with Diff Ab. For the baselines, we constructed wrapper scripts to convert the chosen CDR ids into each method s native format. C.2 MCMC comparison Following Verkuil et al. [78], we construct a Markov chain using uniform random mutations to map a 0 5000 10000 15000 20000 Sampling Steps Figure 9: (left) Comparing convergence in sampling using a Metropolis Hastingsadjusted MCMC [78] against NOS models. Diffusion models (ours) accelerate sampling by two orders of magnitude while converging to similar energy values. sequence w to a mutated sequence w1, using the following Metropolis-Hastings correction: ppaccept w1|wq min ˆ 1, expp Epw1q{Tq expp Epwq{Tq where T ą 0 is a temperature hyperparameter. While this method has appealing theoretical properties, obtaining good samples from this Markov chain in practice requires hundreds of thousands of steps of burn-in. In our experiment (Figure 9), we define the energy, E, by combining sequence level probabilities assigned by Ig LM with a beta sheets objective function trained on Ig LM s representations. We construct the energy as Epwq p IGLMpwq λvθpwq, We tune λ to generate sequences with approximately 40% beta sheets. We also tune the NOS λ parameter (Eq. 4) to produce approximately 40% beta sheets. C.3 PPLM details In order to generate full (heavy and light chain) optimized antibodies with PPLM and Ig LM, we train two separate value function models on Ig LM s aggregated hidden representations, one for heavy chain sequences and one for light chain sequences. Ig LM uses special tokens for both the chain identity and the species identity of each sequences, and we pass in appropriate corresponding tokens when calculating the hidden representations for each model. To determine the correct species token for each sequence, we use the predicted species returned by ANARCI [24]. Our value function is a simple one-layer feed-forward neural network trained on top of the mean-aggregated representations for the corresponding chain identity. To sample using PPLM, we overwrite the forward pass of the huggingface decoder used by Ig LM to include a Langevin sampling step over the current hidden representations. We perform K gradient steps to update the current hidden representation h1 by descending on the objective λKLrpp ˆw|h1q || pp ˆw|hqs vph1q where h is the original hidden representation output by the model s encoder, and η and λ are the step size and regularization strength respectively. We ran optimization with both vanilla gradient descent and Ada Grad [23] and found Ada Grad to be more robust to poor specifications of the step size. For the results in Sec. 5, we draw samples and present results for all of the hyperparameter settings in Table 1 λ 0, 0.001, 0.01, 0.1, 1.0 η 0.5, 0.8, 1.1, 1.4, 1.7, 2. K 5, 10 optimizer SGD, Ada Grad Table 1: Hyperparameter settings used for PPLM. λ controls the strength of the regularization. Large values prevent sampling values that differ significantly from the unguided model. η controls the size of steps taken in the latent space. Larger step sizes, when not too large, can increase the distance traveled in the latent space and the extent to which sampling can yield samples with high values of the objective. One critical difference between controllable autoregressive models and controllable diffusions is the ability to resample previously sampled values. Procedures that allow for resampling are often called iterative refinement procedures because they can produce increasingly plausible generations by refining the model s previous output at each step in an iterative procedure. Because there are many potential differences between our NOS models and PPLM, including but not limited to the nature of iterative refinement, we performed an additional experiment to assess the impact of adapting a discrete diffusion to perform autoregressive sampling. Autoregressive models can themselves be thought of as diffusions with an idiosyncratic corruption process that masks out all tokens to the right of the last sampled token. As in our discrete corruption process, the prior is also a sequence of all mask tokens. Using this insight, we can run our trained discrete diffusions in autoregressive mode by contriving the sampling noise schedule to be autoregressive and recover an approximation of the timestep post-hoc from the percentage of masks at each step in autoregressive sampling. Figure 10 shows the difference in objective values and likelihood for samples obtained by running the model in typical diffusion mode (iterative refinement) or in contrived autoregressive mode. We can see that on the beta sheets objective, iterative refinement has a noticeable positive impact on the objective values of the sample. This effect is also present in the SASA objective, but to a much more limited extent. We speculate that the iterative refinement facet of NOS is helpful for outperforming other methods but not completely sufficient. C.4 Model Architecture and Training The gaussian and categorical diffusions are trained with the bert-small transformer backbone introduced by Bhargava et al. [8]. We use a cosine noise schedule for both diffusions and train for 100 epochs with a batch size of 64, optimizing with Adam W using an initial learning rate of 5e-3 with a linear warmup. The value function is a feed-forward neural network with one hidden layer. The value function is trained jointly with the denoiser by alternating optimization steps, with 5 steps on the generative objective for each step on the discriminative objective. We train the models for 100 epochs in total. C.5 Hyperparameter settings For each guided sampling experiment with NOS, we sample using many different hyperparameter combinations in order to generate both conservative and aggressive optimization of the value function. 500 400 Prot GPT Log Likelihood % Beta Sheets 500 450 400 350 Prot GPT Log Likelihood Autoregressive Iterative Refinement Figure 10: We compare samples from running our guided discrete diffusion (NOS-D) with diffusion style sampling versus autoregressive style sampling. We find that using an iterative refinement procedure does lead to consistent improvements in the objective value, though not to an extent that would suggest iterative refinement is sufficient for strong sampling performance. The full hyperparameter settings for both objectives (beta sheets and SASA) and both corruption types (NOS-D and NOS-C) are shown in Table 2. In Table 2, there is an additional hyperparameter, guidance layer , which we did not discuss at length in the main text of the paper. This parameter dictates whether we perform guidance in the first layer of the neural network (the token embeddings), as is standard in continuous diffusion models for discrete sequences, or the final layer of the neural network (the layer before the final linear head). In either case, we can use the same gradient descent objective and corruption process in each case and need only change the variable we propagate gradient updates to. Table 2 shows the hyperparameters used in the just Figure 5. To aid intuition for the effects of each hyperparameter, we show the sample densities that result from each combination of λ and η in Table 2 when guiding in the first (Figure 11) and last (Figure 12) layer of the NOS-D and NOS-C models. We see that the most important parameter is λ, which controls how far samples tend to move from the seeds. We can also observe that guiding in the first hidden state tends to perform better when sampling with NOS-C, while guiding in the final hidden state tends to perform better with NOS-D. Di Gress comparison Di Gress [79] is built on top of a model with one-hot encodings and discrete corruptions. The guided sampling procedure can be described as follows (using the notation from our submission): At each denoising step t, we use the one-hot encodings as a continuous variable and construct a perturbation distribution from a learned discriminative model ˆv vθpwtq, pθpˆv|wt 1q 9 expp λx wtvθpwtq, wt 1yq We then sample the next value from the base diffusion transition pθpwt 1|wtq perturbed with pθpˆv|wt 1q, wt 1 pθpwt 1|wtqpθpˆv|wt 1q The key details for guided sampling can be found in the Di Gress code repo, where we see that the guided distribution is the normalized product of the original denoising distribution and the softmax of the gradients scaled with λ. On a theoretical level, this guidance has noticeably different properties from NOS. For large λ, the perturbation ppwt|ˆvq collapses to a one-hot on the token index with the largest gradient value. For small values of λ, ppwt|ˆvq becomes a uniform distribution. Therefore λ interpolates ppwt 1|wtq between the original unguided distribution and a one-hot in the max gradient direction. NOS also reduces to unguided infilling when λ 0, but λ ą 0 only modulates the direction of the gradient update. The distance between the guided and unguided distribution is controlled by the number of langevin steps and the step size hyperparameter η. Digress amounts to a single update step applied directly to the output token probabilities using a continuous relaxation of the one-hot encoded input, whereas NOS performs a sequence of local updates to hidden states that are actually continuous. In our comparison, the embeddings and corruptions of each model are chosen to be: 1. NOS-C [Gaussian corruptions + learned embeddings] λ = 10.0 λ = 1.0 Guidance Layer = first λ = 0.1 λ = 0.01 λ = 0.001 600 400 600 400 Prot GPT Log Likelihood 600 400 600 400 NOS-D (Ours) NOS-C (Ours) Figure 11: Density plots for every combination of the regularization (λ) and step-size (η) parameter, when performing guidance in the first layer (token embeddings) of the neural network denoiser. We observe that lambda has the strongest effect on trading off fitness under the objective with likelihood or closeness to the seed sequences. 2. NOS-D [Discrete (mask) corruptions + learned embeddings] 3. Di Gress [Discrete (mask) corruptions + fixed one-hot encodings] All models use the same backbone transformer and regression heads, facilitating an apples-toapples comparison. For Di Gress, we perform sampling for large range of scaling values λ P t1e5, 3e4, 1e4, 3e3, 1e3, 3e2, 1e2, 3e1, 1e1, 1e0, 1e 1, 1e 2, 1e 3u. For each model, λ modulates the degree to which the model prefers greedy sampling from the value function gradient. λ 0.001, 0.01, 0.1, 1.0, 10.0 η 0.1, 0.5, 1.0 K 5, 10 guidance layer first, last optimizer SGD, Ada Grad Table 2: NOS guided sampling hyperparameter settings. λ controls the regularization strength, constraining the plausibility of samples, η, when chosen effectively, can effect the degree of optimization that takes place on the hidden states. The guidance layer is the layer in the neural network over which guidance is applied, the first being the token embeddings and the last being the final representations before the linear head. The same values are used for both NOS-D and NOS-C. λ 0, 0.001, 0.01, 0.1, 1.0, 10.0 η 1.0 K 10 optimizer Ada Grad Table 3: Hyperparameter settings used in Sec. 5. The guidance layer for NOS-D is final, and the guidance layer for NOS-C is last. C.6 Density plots Because pareto fronts present only a partial view of sampling outcomes (focusing on the best case outcomes along each axis), we also include sample density plots to confirm that our methods consistently yield samples with better trade-off between likelihood and fitness. Figure 13 shows λ = 10.0 λ = 1.0 Guidance Layer = last λ = 0.1 λ = 0.01 λ = 0.001 600 400 10000 600 400 600 400 Prot GPT Log Likelihood 600 400 600 400 NOS-D (Ours) NOS-C (Ours) Figure 12: Density plots for every combination of the regularization (λ) and step-size (η) parameter, when performing guidance in the last layer (pre-logits layer) of the neural network denoiser. NOS-C and NOS-D exhibit quite different performance as a function of guiding the first or final hidden representation. % Beta Sheets RFDiffusion NOS-D (Ours) NOS-C (Ours) 600 400 600 400 Prot GPT Log Likelihood 600 400 600 400 RFDiffusion Diff Ab PPLM NOS-D (Ours) NOS-C (Ours) Seeds Figure 13: We compare sample densities for the methods presenting in Sec. 5, in order to augment the limitations of simply showing pareto fronts. We see that NOS-C and NOS-D can both consistently generate samples with favorable trade-offs while other methods tend to radically decrease likelihood with little benefit to the value function or be relatively limited to the neighborhood around the seed sequences. density plots for NOS and baselines when optimizing each of the two objectives (percentage of beta sheets and SASA). We find that Diff Ab and Ig LM samples tend to cluster around the starting seeds, while RFDiffusion samples tend to generate more diverse samples under the objective, but often with much lower likelihood than the seed sequences. By contrast, both NOS methods consistently improve values of the objective without sacrificing likelihoods. D.1 Intro to Multi-Objective Bayesian Optimization When there are multiple objectives of interest, a single best (i.e. strictly dominant) sequence x may not exist. Suppose there are k objectives, f : X Ñ Rk. The goal of multi-objective optimization Algorithm 3 La MBO-2: one guided discrete diffusion step Inputs: Seed sequence w0, edit budget projection P, diffusion timestep t, corruption function cpw, tq, constraint function upwq, encoder gθpwq, value function vθphq, decoder dθphq, regularization strength λ, SGLD step-size η and temperature τ. Returns: Best feasible sample from SGLD chain with distribution p1pxq9 ppxq exppf gpxqq w , v w0, vθ gpw0q (initialize optimal solution) w1 0 cpw0, tq (apply diffusion noise) h1 0 gθpw1 0q (initialize hidden state) for i 1, . . . , I do loss λKLrdθph1 i 1q||dθph1 0qs p1 λqvθph1 i 1q h1 i h1 i 1 Ppη h1loss ?2ητε), ε Np0, Iq (projected SGLD step) wi dθph1 iq (decode hidden state) if v ă vθ gθpwiq & upwiq then w Ð wi v Ð vθ gθpwiq end end return w , v (MOO) is to identify the set of Pareto-optimal (i.e. non-dominated) solutions such that improving one objective within the set leads to worsening another. We say that x dominates x1, or fpxq ą fpx1q, if fjpxq ě fjpx1q for all j P t1, . . . , mu and fjpxq ą fjpx1q for some j. The set of non-dominated solutions X is defined in terms of the Pareto frontier (PF) P , X tx : fpxq P P u, where P tfpxq : x P X, x1 P X s.t. fpx1q ą fpxqu. (7) MOO algorithms typically aim to identify a finite approximation to X (which may be infinitely large), within a reasonable number of iterations. One way to measure the quality of an approximate PF P is to compute the hypervolume HVp P|rrefq of the polytope bounded by P Y trrefu, where rref P Rm is a user-specified reference point. u EHVIpx, f, Dq HVIp P1, P|rrefq r HVp P1|rrefq HVp P|rrefqs , (8) where P1 P Y t ˆfpxqu [27, 28, 18]. To decide where to query f next, we search for argmaxx Eru EHVIpx, f, Dqs, where the expectation is w.r.t. ppf|Dq. D.2 Discrete EHVI Although expression yield and binding affinity are both continuous measurements, we chose to discretize them and model them as classification with a softmax likelihood (See Appendix D.4). As a result we needed an extension of EHVI for discrete outcomes. Informally, EHVI is simply computing the HVI for different realizations of f and marginalizing f using ppf|Dq. Instead of taking f to be the latent function of some regression y fpwq ε. ε Np0, σ2q, we instead take f to be the logits of a categorical distribution, ppy i|w, Dq ş softmaxipfpwqqppf|Dqdf. Let y ry1 yks J. Given a set of baseline points B Ă AL we define P (Eq. 8) using the posterior mean ˆypwq Ery|w, Ds, w P B. We model y1, . . . , yk as conditionally independent given some shared hidden state h gdpwq, so ppy|h, Dq factorizes nicely. Finally we define P1 P Y tyu and take the expectation of Eq. 8 w.r.t. ppy|h, Dq. Since ppy|h, Dq is discrete and factorizes, we can marginalize in closed form when K1 ˆ ˆ Kk is not too large, where Ki is the number of classes corresponding to the discretization of the original continuous fi. D.3 Architecture and Hyperparameters The inputs of the La MBO-2 model for antibody design are the variable heavy (VH) and variable light (VL) regions of the antibody sequence as determined by Aho alignment with ANARCI, as well as the (unaligned) antigen sequence. Note that the concatenation of the antigen to the input makes the samples from the generative head conditional on the antigen as well as the unmasked portion of the antibody sequence. The La MBO-2 model jointly predicts antigen-conditional categorical token distributions for corrupted positions and discriminative distributions over protein properties. Discriminative predictions that should not depend on the antigen are made invariant through data augmentation with random antigen sequences. See Algorithm 3 for an overview of a single guided diffusion step with La MBO-2. Model Architecture: our architecture for this experiment is inspired by the one proposed by Stanton et al. [73]. In particular we jointly a train an encoder shared between a generative discrete diffusion head and discriminative heads which predict expression and affinity. Rather than use a deep kernel GP, we simply ensemble 10 heads for each discriminative task to obtain uncertainty estimates. Like Stanton et al. [73] for this experiment we use 1D CNN residual blocks (kernel width 9), with layer normalization and sinusoidal position embeddings. The shared encoder was comprised of 4 residual blocks, and each task head was comprised of 2 residual blocks followed by a linear layer, with the exception of the generative head which was just a linear layer on top of the shared embeddings. Note that in future work self-attention layers could be used instead of CNN layers, as was the case for the p OAS experiments in Sec. 5. We set the embedding dimension to 32, and the latent channel dimension to 256. Training Hyperparameters: The La MBO-2 model is both a jointly trained generative and discriminative model, as well as a true multi-task model, which is necessary since measurements for various protein properties are often missing from a substantial fraction of rows in real-world datasets. We trained for 500K gradient updates using the Adam optimizer with η 1e-3, β0 0.99, β1 0.999. At each gradient step we randomly sampled a task head and task minibatch (batch-size 121) and updated the corresponding weights (including shared weights). We used a linear learning rate warmup over 10K gradient updates, and decayed the learning rate to 1e-6 with a cosine schedule. We did not regularize with weight decay or dropout. Generation Hyperparameters: to generate the designs in Figure 7, we sampled 1K designs from a pool of seed antibody sequences hand-selected by domain experts. For each seed we set the total edit budget shared between chains to B 16. In this experiment each infilling method took 16 diffusion steps, using an inverse linear noise schedule αt 1{p1 tq. Although the models were trained with a standard cosine noise schedule, we found the inverse linear schedule gave better results in terms of sample acquisition value at generation time. Within each diffusion step we took 64 Langevin steps, with noise scale τ 1e-2. For guided infills with uniformly distributed edit positions we set τ 1e6. For guided infills with saliency-informed edit position selection we set τ 0.1. We set λ 0.5 to balance the tradeoff of sequence likelihood and value during guidance. Generation Constraints: in addition to the edit budget locality constraint, our La MBO-2 designs were also constrained to meet certain sequence liabilities constraints: Canonical Cysteine Conservation: there are specific conserved cysteine residues in antibody sequences which play a crucial role in the formation of disulfide bridges. Disulfide bridges are covalent bonds formed between two cysteine residues through oxidation of their sulfur atoms. These bridges contribute to the overall structural stability and integrity of antibodies. No Unpaired Cysteines: odd numbers of cysteines within individual chains (i.e. unpaired cysteines) are generally undesirable since they can lead to non-native disulfide bonds between different antibody molecules, which may disrupt assembly, folding, or function. No Glycosylation Motifs: A glycosylation motif is a specific amino acid sequence within a protein that serves as a recognition site for the attachment of sugar molecules. The presence of a glycosylation motif in a protein can affect its stability, solubility, activity, and function. The addition of sugar molecules can alter the protein s conformation, change its interactions with other proteins or molecules, and affect its trafficking and localization within the cell. D.4 Training Data, Class Imbalance, and Label Smoothing Training Data: the expression task heads were trained on a dataset of 10K linear transfection expression measurements, which was subsequently augmented to 160K rows by pairing the same measurements with different random antigens to teach the model to ignore the antigen sequence when predicting expression. The binding task heads were trained on a dataset of 10K SPR affinity measurements for various antigens, which was then augmented to 12K rows by pairing binders with different random antigens and imputing a non-binding label. This augmentation is important for 0 2 4 6 8 10 measured p KD 0 1 2 3 4 5 6 7 8 9 10 quantized p KD 0 1 2 3 4 5 6 7 8 9 10 quantized p KD Figure 14: An illustration of using quantization to address heavily imbalanced data. On the right we show the original marginal label distribution in green, and the discretization boundaries as dotted lines. The boundaries are defined by a minimal level of affinity to be considered a binder (p KD 4), and p KD deciles computed from the remaining measurements. training a pan-target affinity model, since experimental measurements of affinity to off-target antigens are uncommon. Note that the expression and affinity data only partially overlapped, necessitating the multi-task architecture described in Appendix D.3. The generative diffusion head was trained only on binding antibody-antigen pairs in the SPR binding data. We did not pretrain our La MBO-2 models. It is likely that performance could be improved with the right pretraining corpus, however it is unclear if datasets like p OAS are particularly useful for pretraining antibody design models since most do not report antigen sequences and may not have the right level of variability. In any case, it is very encouraging to see positive real-world results before scaling in earnest. Label Discretization. As noted above, biological data tends to be very imbalanced, and historical experimental data even more so since there are strong selection effects imposed by the scientists collecting the data. We chose to discretize continuous properties like expression yield and binding affinity, making it easier to correct for class imbalance by upsampling minority classes. In Figure 14 we illustrate our discretization scheme. Any antibody-antigen pair with logp KDq (p KD) less than 4 was assigned to the non-binding class 0. Then binders were assigned to classes 1 - 10 based on which p KD decile (computed from binders only) they resided in. One consequence of this scheme is increasing any objective value by one unit corresponds to moving up one decile in the empirical label distribution. Training Discriminators on Noisy Inputs: the benefits of discretization are not limited to addressing class imbalance. Working with discretized labels also allowed a simple approach to training the discriminator on corrupted inputs inspired by label smoothing [76]. We train the discriminators with the same noise schedule as the diffusion model and the usual cross-entropy loss, using modified labels yt αt y p1 αtq{K 1, where y is the one-hot encoded label and K is the number of classes. Informally, as αt Ñ 0 the discriminator reverts to a uniform prior since the inputs are not distinguishable. Training on corrupted inputs avoids evaluating the value gradient on out-of-distribution inputs during generation, and causes the strength of the value gradient to grow as the diffusion progresses and the samples become more defined. D.5 Baselining La MBO-2 Against Unguided Sequence and Structure-Based Diversification: Structure-Based Diversification We have shown that we can effectively optimize antibodies for predicted yield and affinity, and our method performs well compared to unguided sequence-based infilling methods. We expand our evaluation for this task to include unguided infilling with Diff Ab and RFDiffusion of CDRs H2 and H3 of hu4D5 (i.e. the seed), a publicly released therapeutic antibody that is ideally suited for structure-based methods since we have a ground truth crystal structure of hu4D5 docked with its target ERBB2. While it is not feasible to validate the resulting designs in vitro during the author response period, we can compare the Anti BERTy naturalness scores and the acquisition value (log expected hypervolume improvement or log-EHVI) of the designs relative to our guided infills (Fig. 15). To summarize, unguided structure-based infilling produces high likelihood samples, but even when conditioned on the antigen the distribution shift toward better predicted function is very slight. hu4D5 Diff Ab La MBO-2 RFDiffusion 0.40 0.45 0.50 0.55 0.60 Naturalness 50 40 30 20 10 0 Acquisition Value 50 40 30 20 10 0 Acquisition Value RFDiff w/o antigen RFDiff w/ antigen Figure 15: (left) we find that structure-based infills, particularly from Diff Ab, tend to score consistently well on naturalness. Guided infilling produces a much wider range of scores, but the mode is very close to that of RFDiffusion. (middle) as assessed by the same model used to guide towards higher yield and binding affinity. The guided infills have very high acquisition value, since they were explicitly optimized for that outcome. Given 1024 samples each, Diff Ab failed to produce any sequences of higher expected value than the seed, and RFDiffusion produced only 7 marginally improved designs. We also took the opportunity to assess the sensitivity of RFDiffusion to the antigen by comparing infills generated using the antibody stucture only (right). While the effect is not large, antigen information does produce a small shift in the distribution of acquisition values to the right. Sequence Diversification This in silico evaluation compares two variants of La MBO-2 (one using NOS-C, the other NOS-D) against a competing method, walk-jump sampling (WJS), an unguided smoothed discrete sampling algorithm proposed by Frey et al. [30]. Each method generated 1K designs from the same set of seeds, and all methods were restricted to B 8 edits. La MBO-2 chose all edit positions automatically along the entire antibody sequence, whereas WJS was given manually selected edit positions restricted to CDRs only. In the left two panels of Figure 7 we compare the predicted expression yield, predicted binding affinity, and naturalness of the antibody designs, using the metric proposed by . Comparing the Pareto frontiers obtained from each set of designs, we see that while WJS excels at generating natural antibodies, it struggles to generate designs at the higher end of the objective range. Conversely La MBO-2 designs (particularly those generated with NOS-C) have high predicted objective value but also lower naturalness scores. La MBO-2 designs generated with NOS-D strike a balance between the two extremes. Seeds NOS-C (Ours) NOS-D (Ours) WJS 0.40 0.45 0.50 0.55 Naturalness Pred. Expression 0.45 0.50 0.55 Naturalness Pred. Affinity Figure 16: We evaluate La MBO-2 in the context of real-world antibody lead optimization. La MBO-2 can use either NOS-C or NOS-D to generate design libraries with higher predicted objective value than the unguided sampling baseline WJS [30], however intensive optimization comes at the cost of reduced naturalness (panels left and center). D.6 Are Saliency Maps Reliable? There is substantial controversy regarding the reliability of input-gradient-based feature attribution methods, specifically related to their ability to consistently highlight ground truth task-discriminative features and ignore irrelevant features. For example, Hooker et al. [41] claim that random attribution is competitive with input-gradient methods, and Casper et al. [11] claim that gradient-free attribution outperforms input-gradient competitors. On the other hand, many papers claim that specific types Gaussian Input Noise Input Masking EVQLVES - GGGLVQPGGS LRL SCAASG - FN I KD - - - - - T Y I HWVRQAPGKGL EWVAR I Y P T - - - NGY TRYADSVKGRF T I SADTSKNTAY LQMNS LRAEDTAVY YCSRWGGDG - - - - - - - - - - - - - - - - - - - F YAMDYWGQGT LVTVSS 0 D I QMTQSPSS L SASVGDRVT I TCRAS - - QDVN - - - - - - TAVAWYQQKPGKAPKL L I YS - - - - - - - - AS F L YSGVPSRF SGSRSG - - TDF T L T I SS LQPEDFATYYCQQHYT - - - - - - - - - - - - - - - - - - - - - - - TP PT FGQGTKVE I K 0 Figure 17: Binding affinity feature attributions for hu4D5 produced by independent models trained with different input corruptions. While the attributions do not match exactly, there is substantial agreement on the importance of CDRH3 (top panel) and CDRL1. Some importance is also assigned to various framework regions, which could be related to the fitness of different antibody germlines. We emphasize that these models were trained solely on aligned sequences, with no additional positional information. of regularization can improve the performance of input-gradient attribution, including adversarial training [66], mask denoising [6], and model curvature penalties [72]. A thorough investigation of these claims is beyond the scope of this work, however we have found that saliency maps produced by independent models trained with different corruption processes seem to consistently highlight specific regions of the antibody sequence (Figure 17). It is also worth noting that most of the related literature evaluates feature attribution in the offline setting. In La MBO-2 feature attributions are used online to intervene on the data collection process (specifically where to introduce changes in the antibody sequences). If La MBO-2 changes a position that does not affect function it is reasonable to conjecture that input-gradient attributions would adjust accordingly after the model is retrained for the next round. Further investigation into feature attribution in decision-making contexts (as opposed to post hoc interpretability) is an exciting direction for future work. D.7 Wetlab Validation 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 Edit Distance from Seed 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 Edit Distance from Seed Δ yield (mg) 1.0 2.0 3.0 4.0 5.0 6.0 7.0 Edit Distance from Seed -log10(KD) (M) 1.0 2.0 3.0 4.0 5.0 6.0 7.0 Edit Distance from Seed -Δlog10(KD) (M) Figure 18: Here we show the experimentally validated yield (top) for all expressing designs and affinity (bottom) for all binding designs as a function of edit-distance from the original seed. In the right column we show the absolute measurement, and the left column shows the change relative to a seed measurement in the same batch. In this section we briefly summarize the experimental procedures used to validate La MBO-2 designs in vitro. Designed antibody sequences from La MBO-2 were expressed and purified, and surface plasmon resonance (SPR) measurements were used to determine binding affinity. See Figure 18 for a plot of design binding affinity vs. edit distance from seed antibody. Plasmid Construction and Antibody Production: synthesized DNA of antibody variable domains (Twist Biosciences) were cloned into mammalian expression vectors using Gibson assembly. The whole vector was amplified using Prime Star Max polymerase (Takeda). PCR products were transfected transiently in 1m L Expi293 cell culture. Expression lasted 7 days before harvest. Antibodies were affinity purified over a MAb Select Su Re resin (Cytiva), and their concentration was measured by optical density at 280n M. Binding Affinity Measurements: affinity of the antibodies towards their target antigen was measured by surface plasmon resonance (SPR) at 37 C on a Biacore 8K instrument (Cytiva) in HBS-EP+ buffer (10 m M Hepes, p H 7.4, 150 m M Na Cl, 0.3m M EDTA and 0.05% vol/vol Surfactant P20). Antibodies were captured on a Protein A chip and their target antigen were injected for 5 minutes and allowed to dissociate for 10 minutes at 30ul/min. The surface was regenerated between cycles with 10 m M glycine p H 1.5. Affinity constants were obtained using Biacore Insight (Cytiva) using a 1:1 binding kinetics model.