# structureinformed_language_models_are_protein_designers__69494e77.pdf Structure-informed Language Models Are Protein Designers Zaixiang Zheng * 1 Yifan Deng * 2 1 Dongyu Xue 1 Yi Zhou 1 Fei Ye 1 Quanquan Gu 1 Abstract This paper demonstrates that language models are strong structure-based protein designers. We present LM-DESIGN, a generic approach to reprogramming sequence-based protein language models (p LMs), that have learned massive sequential evolutionary knowledge from the universe of natural protein sequences, to acquire an immediate capability to design preferable protein sequences for given folds. We conduct a structural surgery on p LMs, where a lightweight structural adapter is implanted into p LMs and endows it with structural awareness. During inference, iterative refinement is performed to effectively optimize the generated protein sequences. Experiments show that LM-DESIGN improves the state-of-the-art results by a large margin, leading to 4% to 12% accuracy gains in sequence recovery (e.g., 55.65%/56.63% on CATH 4.2/4.3 single-chain benchmarks, and >60% when designing protein complexes). We provide extensive and in-depth analyses, which verify that LM-DESIGN can (1) indeed leverage both structural and sequential knowledge to accurately handle structurally non-deterministic regions, (2) benefit from scaling data and model size, and (3) generalize to other proteins (e.g., antibodies and de novo proteins). 1. Introduction Proteins are 3D-folded linear chains of amino acids that govern biological functions such as transcription, translation, signaling, and cell cycle control. Recently, the promise of learning to understand and design proteins from data via generative deep learning has led to an ongoing paradigm shift apart from the long-established physics-based methods. Designing protein sequences that fold into desired struc- *Equal contribution 1Byte Dance Research 2Dept. of Computer Science, University of Wisconsin-Madison (work was done during Yifan s internship at Byte Dance Research). Correspondence to: Zaixiang Zheng , Fei Ye . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). tures, namely structure-based protein (sequence) design, is one of the most important problems in bio-engineering. Significant progress has been made by several latest deep generative model-based approaches (Ingraham et al., 2019; Jing et al., 2020; Hsu et al., 2022; Dauparas et al., 2022; Gao et al., 2022). These approaches formulate structure-based protein design as an end-to-end graph-to-sequence learning problem, where an encoder-decoder model Mθ : X S is tasked with predicting protein sequence S given a protein backbone structure X. Typically, supervised learning is performed on such models given a certain amount of protein structure-sequence pair data. Albeit deep generative models showing revolutionized capability in this field, we argue that the current neural structurebased protein design approaches are not necessarily at their best in designing more plausible proteins as two major obstacles remain and hinder further progress: (i) Limited experimentally determined protein structure data. For example, the known protein structures in the commonly-used CATH (Orengo et al., 1997) dataset are multiple orders of magnitude smaller (< 0.1%) than the sequence data in the Uni Ref (Suzek et al., 2015) sequence database (Fig. 1D-E). As structure-based protein design is essentially a conditional sequence learning problem, the protein sequence distribution is crucial yet remains elusive for generally data-hungry generative models due to limited data. Therefore, they fail to holistically explore the protein sequence space and tend to yield sub-optimal sequence predictions for folds. Despite being partly remedied by data-augmented approach (Hsu et al., 2022), additional predicted structure data and trainable model parameters at scale demand compute and storage overheads. (ii) Challenge of structurally non-deterministic regions. From a biological perspective, protein structures are sometimes not sufficiently informative, especially for those flexible regions such as loops and exposed surfaces (Towse & Daggett, 2012). In these regions, residue identities can, hypothetically, be less correlated with the structural context while sequential knowledge is way more useful yet largely neglected. We verified this hypothesis and found that existing purely structurebased approaches were prone to produce functionally invalid sequences for these regions (Fig. 1A-B). Structure-informed Language Models Are Protein Designers iteratively [cls]YKTVRAGRLGSISRSLER[eos] [cls]MKTVRQERLKSIVRILER[eos] structural adapter structure encoder (GNNs, Protein MPNN, GVP, IPA, etc.) Multihead ATTN Transformer layer Multihead ATTN + FFN sequence decoder: p LM (ESM series, etc) F structure-based sequence design models (GNNs,Protein MPNN, GVP, Pi Fold, IPA, etc.) N Multihead ATTN + FFN Transformer layer protein language models (p LMs) (ESM-1b, ESM-2 series) Uni Ref-50 sequences (~5 x107) CATH structures (~2 x104) Figure 1. Overview. (A) Case study of Tyrosine kinase activation loop. Ribbon diagram shows the structure of Tyrosine kinase mapped with Alpha Fold2 p LDDT score. The activation loop is characterized with low p LDDT scores suggesting flexible conformations; (B) Multiple sequence alignment of the activation loop showing this sequence is highly evolutionary conserved. Predictions from Protein MPNN and LM-DESIGN are shown; (C) Preliminary study on refinement ability for p LMs. Here ESM-1b took as input the predictions of Protein MPNN; (D) Illustration of neural structure-based protein sequence design, and (E) protein language models; (F) Overall illustration of LM-DESIGN, where the wonderful colored protein structure image is credited to RFDiffusion (Watson et al., 2022). Therefore, the sequential information should be better utilized for structure-based protein design. Inspired by the impressive progress of large language models (LLMs) in natural language processing (NLP) (Devlin et al., 2019; Radford et al., 2018; Brown et al., 2020), recent literature in protein research has also demonstrated the emergent evolutionary knowledge of proteins in protein language models (p LMs, Rives et al., 2019; Lin et al., 2022; Hu et al., 2022), learned from the universe of massive protein sequence data. Such comprehensive and thorough sequential knowledge of p LMs can help probe functional properties and even predict protein structures from single sequences without the need for explicit evolutionary homologs (e.g., MSAs). Thus, an exciting research question naturally arises: Since p LMs are such strong sequence learners, can we leverage p LMs to make better structurebased protein design? If so, rather than as protein sequence encoders, p LMs can possibly be repurposed as sequence generative models (since they are learned to reconstruct corrupted protein sequences), prompted by the desired structure to generate sequences, making the most of the acquired sequential evolutionary knowledge. How to best achieve this goal, however, is non-trivial and remains under-explored (we will discuss our preliminary attempts that uses p LMs for post-editing to provide insights that motivate our proposal in 2), thus deserves to be comprehensively studied. In this paper, we show that language models with structural surgery are strong protein designers without using abundant training data. We propose LM-DESIGN, a generic approach to reprogramming sequence-based protein language models (p LMs) to design protein sequences of a desired fold. As shown in Fig. 1F, we conduct a structural surgery on a p LM (e.g., ESM-1b), where a lightweight structural adapter is implanted to endow p LMs with structural awareness by access to an arbitrary additional structure encoder (e.g., Protein MPNN). During inference, iterative refinement is performed to optimize the generated protein sequence until convergence when the prediction can no longer be improved. We highlight our contributions and findings as follows: We introduce LM-DESIGN, a generic approach that transforms p LMs to protein design models via structural surgery. LM-DESIGN yields preferable protein sequences for desired structures, while being modelagnostic, modularizable, parameterand data-efficient. Experiments show that LM-DESIGN advances the state-of-the-art methods by a large margin, achieving 55.65% and 56.76% sequence recovery on CATH 4.2 and CATH 4.3 for single-chain proteins, and >60% for protein complexes. LM-DESIGN can be also combined with data augmentation (Hsu et al., 2022), where additional large amounts of predicted protein structures by Alpha Fold 2 (Jumper et al., 2021) are leveraged. In particular, we find that LM-DESIGN can accurately handle structurally non-deterministic regions Structure-informed Language Models Are Protein Designers (e.g., functional loops and exposed surfaces) thanks to the learned sequence knowledge from p LMs, while previous methods typically fail. We also find that LMDESIGN can indeed be structurally sensitive, thereby better determining the nuanced sequential specificity of those protein groups of high structural similarity. We also show that LM-DESIGN can synthesize diverse and structurally valid sequences. We further evaluate zero-shot generalizability of LM-DESIGN in designing proteins of unseen categories, including antibodies and de novo proteins, and observe superb performance. We highlight that the goal of this study to propose LMDESIGN is not to compete but instead to complement current neural structure-based sequence design models. We hope that LM-DESIGN can become a powerful, universal, and easy-to-use tool as a wrapper that helps integrate the advances of both protein sequence learning (e.g., p LMs) and structure learning (e.g., geometric/graph NNs and protein structure prediction), facilitating future protein research. 2. Preliminaries 2.1. Structure-based Protein (Sequence) Design Structure-based sequence design problem (a.k.a., protein inverse folding) is to find, given a protein backbone structure of interest, an amino acid sequence that will fold to this structure (Dauparas et al., 2022). While physics-based approaches tackle sequence design as an energy minimization problem (Dahiyat & Mayo, 1997; Street & Mayo, 1999) like the Rosetta (Alford et al., 2017), recent advances in deep learning (DL) methods have demonstrated great promise in generating plausible amino acid sequences for desired protein structures (Ingraham et al., 2019; Hsu et al., 2022). Problem Formulation. Neural structure-based protein design can be formulated as an end-to-end graph-to-sequence learning problem. Formally, a parameterized encoderdecoder neural model Mθ is tasked with predicting the protein sequence for a protein backbone structure, where for a protein of length L, S = {si Cat(20)|1 i L} is a residue sequence of 20 types of amino acids, and X = {xi RNatoms 3|1 i L} denotes the spatial coordinates in 3D space for the residues of the desired protein structure with Natoms backbone atoms (e.g., N, Cα and C, with O optionally). The learning objective is to find the model parameter θ that maximizes the conditional log-likelihood p(S|X; θ) given sufficient protein structuresequence paired data. This enables us to design sequences of maximum likelihood, or with sampling algorithms when the diversity and novelty of designs are taken into account. Overview. The general workflow of these approaches (Fig. 1C) is as follows: (1) A desired protein backbone structure X is first represented as a k-nearest-neighbor (k NN) graph in 3D space with geometric features attaching to nodes and edges of the graph; (2) A graph neural network-based encoder then takes as input the featurized graph and maps it to structural representations; and (3) Finally, a sequence decoder consumes the encoded structural representations and accordingly predicts a sequence of amino acids S that is expected to fold into the target protein structure X, in which an autoregressive decomposition p(S|X) = QL i=1 p(si|S