# generative_models_for_graphbased_protein_design__9514433f.pdf Generative models for graph-based protein design John Ingraham, Vikas K. Garg, Regina Barzilay, Tommi Jaakkola Computer Science and Artificial Intelligence Lab, MIT {ingraham, vgarg, regina, tommi}@csail.mit.edu Engineered proteins offer the potential to solve many problems in biomedicine, energy, and materials science, but creating designs that succeed is difficult in practice. A significant aspect of this challenge is the complex coupling between protein sequence and 3D structure, with the task of finding a viable design often referred to as the inverse protein folding problem. In this work, we introduce a conditional generative model for protein sequences given 3D structures based on graph representations. Our approach efficiently captures the complex dependencies in proteins by focusing on those that are long-range in sequence but local in 3D space. This graph-based approach improves in both speed and reliability over conventional and other neural network-based approaches, and takes a step toward rapid and targeted biomolecular design with the aid of deep generative models. 1 Introduction A central goal for computational protein design is to automate the invention of protein molecules with defined structural and functional properties. This field has seen tremendous progess in the past two decades [1], including the design of novel 3D folds [2], enzymes [3], and complexes [4]. Despite these successes, current approaches are often unreliable, requiring multiple rounds of trial-and-error in which initial designs often fail [5, 6]. Moreover, diagnosing the origin of this unreliability is difficult, as contemporary bottom-up approaches depend both on the accuracy of complex, composite energy functions for protein physics and also on the efficiency of sampling algorithms for jointly exploring the protein sequence and structure space. Here, we explore an alternative, top-down framework for protein design that directly learns a conditional generative model for protein sequences given a specification of the target structure, which is represented as a graph over the residues (amino acids). Specifically, we augment the autoregressive self-attention of recent sequence models [7] with graph-based representations of 3D molecular structure. By composing multiple layers of this structured self-attention, our model can effectively capture higher-order, interaction-based dependencies between sequence and structure, in contrast to previous parameteric approaches [8, 9] that are limited to only the first-order effects. A graph-structured sequence model offers several benefits, including favorable computational efficiency, inductive bias, and representational flexibility. We accomplish the first two by leveraging a well-evidenced finding in protein science, namely that long-range dependencies in sequence are generally short-range in 3D space [10 12]. By making the graph and self-attention similarly sparse and localized in 3D space, we achieve computational scaling that is linear in sequence length. Additionally, graph structured inputs offer representational flexibility, as they accomodate both coarse, flexible backbone (connectivity and topology) as well as fine-grained (precise atom locations) descriptions of structure. We demonstrate the merits of our approach via a detailed empirical study. Specifically, we evaluate our model s performance for structural generalization to sequences of protein 3D folds that are topologically distinct from those in the training set. For fixed-backbone sequence design, we find that 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. our model achieves considerably improved statistical performance over a prior neural-network based model and also that it achieves higher accuracy and efficiency than Rosetta fixbb, a state-the-art program for protein design. The rest of the paper is organized as follows. We first position our contributions with respect to the prior work in Section 1.1. We provide details on our methods, including structure representation, in Section 2. We introduce our Structured Transformer model in Section 2.2. The details of our experiments are laid out in Section 3, and the corresponding results that elucidate the merits of our approach are presented in Section 4. 1.1 Related Work Generative models for protein sequence and structure A number of works have explored the use of generative models for protein engineering and design [13]. [8, 9, 14] have used neural networkbased models for sequences given 3D structure, where the amino acids are modeled independently of one another. [15] introduced a generative model for protein sequences conditioned on a 1D, contextfree grammar based specification of the fold topology. Multiple works [16, 17] have modeled the conditional distribution of single amino acids given surrounding structure and sequence context with convolutional neural networks. In contrast to these works, our model captures the joint distribution of the full protein sequence while grounding these dependencies in terms of long-range interactions arising from structure. In parallel to the development of structure-based models, there has been considerable work on deep generative models for protein sequences in individual protein families [18 21]. While useful, these methods presume the availability of a large number of sequences from a particular family, which are unavailable in the case of designing novel proteins that diverge significantly from natural sequences. Several groups have obtained promising results using unconditional protein language models [22 25] to learn sequence representations that transfer well to supervised tasks. While serving different purposes, we emphasize that one advantage of conditional generative modeling is to facilitate adaptation to specific (and potentially novel) parts of structure space. Language models trained on hundreds of millions of evolutionary sequences will still be semantically bottlenecked by the thousands of 3D evolutionary folds that these sequences represent. We propose evaluating protein language models with structure-based splitting of sequence data, and begin to see how unconditional language models may struggle to assign high likelihoods to sequences from out-of-training folds. In a complementary line of research, several deep and differentiable parameterizations of protein structure [26 29] have been recently proposed that could be used to generate, optimize, or validate 3D structures for input to sequence design. Protein design and interaction graphs For classical approaches to computational protein design, which are based on joint modeling of structure and sequence, we refer the reader to a review of both methods and accomplishments in [1]. Many of the major firsts in protein design are due to Rosetta [30, 31], a leading framework for protein design. More recently, there have been successes with non-parametric approaches to protein design [32] which are based on finding substructural homologies between the target and diverse templates in large protein database. In this work, we focus on comparisons to Rosetta (Section 4.2), since it is based on a single parametric energy function for capturing the sequence-structure relationship. Self-Attention Our model extends the Transformer [33] to capture sparse, pairwise relational information between sequence elements. The dense variation of this problem was explored in [34] and [35]. As noted in those works, incorporating general pairwise information incurs O(N 2) memory (and computational) cost for sequences of length N, which can be highly limiting for training on GPUs. We circumvent this cost by instead restricting the self-attention to the sparsity of the input graph. Given this graph-structured self-attention, our model may also be reasonably cast in the framework of message-passing or graph neural networks [36, 37] (Section 4.1). Our approach is similar to Graph Attention Networks [38], but augmented with edge features and an autoregressive decoder. Structure G Self-attention Position-wise Feedforward Node embeddings Edge embeddings Causal Self-attention Position-wise Feedforward Structure and sequence Structure Encoder Sequence Decoder (autoregressive) Node (amino acid) Information flow Figure 1: A graph-based, autoregressive model for protein sequences given 3D structures. (A) We cast protein design as language modeling conditioned on an input graph. In our architecture, an encoder develops a sequence-independent representation of 3D structure via multi-head self-attention [7] on the spatial k-nearest neighbors graph. A decoder then autoregressively generates each amino acid si given the full structure and previously decoded amino acids. (B) Each layer of the encoder and decoder contains a step of neighborhood aggregation (self-attention) and of local information processing (position-wise feedforward). In this work, we introduce a Structured Transformer model that draws inspiration from the selfattention based Transformer model [7] and is augmented for scalable incorporation of relational information (Figure 1). While general relational attention incurs quadratic memory and computation costs, we avert these by restricting the attention for each node i to the set N(i, k) of its k-nearest neighbors in 3D space. Since our architecture is multilayered, iterated local attention can derive progressively more global estimates of context for each node i. Second, unlike the standard Transformer, we also include edge features to embed the spatial and positional dependencies in deriving the attention. Thus, our model generalizes Transformer to spatially structured settings. 2.1 Representing structure as a graph We represent protein structure in terms of an attributed graph G = (V, E) with node features V = {v1, . . . , v N} describing each residue (amino acid, which are the letters which compose a protein sequence) and edge features E = {eij}i =j capturing relationships between them. This formulation can accommodate different variations on the macromolecular design problem, including both rigid backbone design where the precise coordinates of backbone atoms are fixed, as well as flexible backbone design where softer constraints such as blueprints of hydrogen-bonding connectivity [5] or 1D architectures [15] could define the structure of interest. 3D considerations For a rigid-body design problem, the structure for conditioning is a fixed set of backbone coordinates X = {xi R3 : 1 i N}, where N is the number of positions1. We desire a graph representation of the coordinates G(X) that has two properties: Invariance. The features are invariant to rotations and translations. Locally informative. The edge features incident to vi due to its neighbors N(i, k), i.e. {eij}j N(i,k), contain sufficient information to reconstruct all adjacent coordinates {xj}j N(i,k) up to rigid-body motion. While invariance is motivated by standard symmetry considerations, the second property is motivated by limitations of current graph neural networks [36]. In these networks, updates to node features 1Here we consider a single representative coordinate per position when deriving edge features but may revisit multiple atom types per position for features such as backbone angles or hydrogen bonds. Edge features Sparse directions Backbone structure Distance Direction Rotation Point cloud with local frames Figure 2: Spatial features capture structural relationships across diverse folds. (A) The edge features of our most detailed protein graph representation capture the relative distance, direction, and orientation between two positions on the backbone. For scalability, all computation after an initially dense Euclidean distance calculation (right, top), such as relative directions (right, bottom) and neural steps, can be restricted to the k-Nearest Neighbors graph. (B) Example of topological variation in the dataset. Protein chains in train, test, and validation are split by the sub-chain CATH [40] topologies, which means that folds in each set will have distinct patterns of spatial connectivity. vi depend only on the edge and node features adjacent to vi. However, typically, these features are insufficient to reconstruct the relative neighborhood positions {xj}j N(i,k), so individual updates cannot fully depend on the local environment . For example, when reasoning about the neighborhood around coordinate xi, the pairwise distances Dia and Dib will be insufficient to determine if xa and xb are on the same or opposite sides. Relative spatial encodings We develop invariant and locally informative features by first augmenting the points xi with orientations Oi that define a local coordinate system at each point (Figure 2). We define these in terms of the backbone geometry as Oi = [bi ni bi ni] , where bi is the negative bisector of angle between the rays (xi 1 xi) and (xi+1 xi), and ni is a unit vector normal to that plane. Formally, we have ui = xi xi 1 ||xi xi 1||, bi = ui ui+1 ||ui ui+1||, ni = ui ui+1 ||ui ui+1||. Finally, we derive the spatial edge features e(s) ij from the rigid body transformation that relates reference frame (xi, Oi) to reference frame (xj, Oj). While this transformation has 6 degrees of freedom, we decompose it into features for distance, direction, and orientation as e(s) ij = r (||xj xi||) , OT i xj xi ||xj xi||, q OT i Oj . Here the first vector is a distance encoding r( ) lifted into a radial basis2, the second vector is a direction encoding that corresponds to the relative direction of xj in the reference frame of (xi, Oi), and the third vector is an orientation encoding q( ) of the quaternion representation of the spatial rotation matrix OT i Oj. Quaternions represent 3D rotations as four-element vectors that can be efficiently and reasonably compared by inner products [39].3 Relative positional encodings As in the original Transformer, we also represent distances between residues in the sequence (rather than space) with positional embeddings e(p) ij . Specifically, we need to represent the positioning of each neighbor j relative to the node under consideration i. Therefore, 2We used 16 Gaussian RBFs isotropically spaced from 0 to 20 Angstroms. 3We represent quaternions in terms of their vector of real coefficients. we obtain the position embedding as a sinusoidal function of the gap i j. We retain the sign of the distance i j because protein sequences are generally asymmetric. These relative encodings contrast the absolute encodings of the original Transformer, but are consistent with modifications described in subsequent work [34]. Node and edge features Finally, we obtain an aggregate edge encoding vector eij by concatenating the structural encodings e(s) ij with the positional encodings e(p) ij and then linearly transforming them to have the same dimension as the model. We only include edges in the k-nearest neighbors graph of X, with k = 30 for all experiments. This k is generous, as typical definitions of residue-residue contacts in proteins will result in < 20 contacts per residue. For node features, we compute the three dihedral angles of the protein backbone (φi, ψi, ωi) and embed these on the 3-torus as {sin, cos} (φi, ψi, ωi). Flexible backbone features We also consider flexible backbone descriptions of 3D structure based on topological binary edge features and coarse backbone geometry. We combine the relative positional encodings with two binary edge features: contacts that indicate when the distance between Cα residues at i and j are less than 8 Angstroms and hydrogen bonds which are directed and defined by the electrostatic model of DSSP [41]. For coarse node features, we compute virtual dihedral angles and bond angles between backbone Cα residues, interpret them as spherical coordinates, and represent them as points on the unit sphere. 2.2 Structured Transformer Autoregressive decomposition We decompose the distribution of a protein sequence given a 3D structure as p(s|x) = Y i p(si|x, s j (h(enc) j , eij, 0) i j . Here h(dec) j is the embedding of node j in the current layer of the decoder, h(enc) j is the embedding of node j in the final layer of the encoder, and g(sj) is a sequence embedding of amino acid sj at node j. This concatenation and masking structure ensures that sequence information only flows to position i from positions j < i, but still allows position i to attend to subsequent structural information unlike the standard Transformer decoder. We now demonstrate the merits of our approach via a detailed empirical analysis. We begin with the experimental set up including our architecture, and description of the data used in our experiments. Architecture In all experiments, we used three layers of self-attention and position-wise feedforward modules for the encoder and decoder with a hidden dimension of 128. Optimization We trained models using the learning rate schedule and initialization of the original Transformer paper [7], a dropout rate of 10% [42], a label smoothing rate of 10%, and early stopping based on validation perplexity. The unconditional language models did not include dropout or label smoothing. Dataset To evaluate the ability of our models to generalize across different protein folds, we collected a dataset based on the CATH hierarchical classification of protein structure [40]. For all domains in the CATH 4.2 40% non-redundant set of proteins, we obtained full chains up to length 500 and then randomly assigned their CATH topology classifications (CAT codes) to train, validation and test sets at a targeted 80/10/10 split. Since each chain can contain multiple CAT codes, we first removed any redundant entries from train and then from validation. Finally, we removed any chains from the test set that had CAT overlap with train and removed chains from the validation set with CAT overlap to train or test. This resulted in a dataset of 18024 chains in the training set, 608 chains in the validation set, and 1120 chains in the test set. There is zero CAT overlap between these sets. A challenge in evaluating computational protein design methods is the degeneracy of the relationship between protein structure and sequence. Many protein sequences may reasonably design the same Table 2: Per-residue perplexities for protein language modeling (lower is better). The protein chains have been cluster-split by CATH topology, such that test includes only unseen 3D folds. While a structure-conditioned language model can generalize in this structure-split setting, unconditional language models struggle. Test set Short Single chain All Structure-conditioned models Structured Transformer (ours) 8.54 9.03 6.85 SPIN2 [8] 12.11 12.61 - Language models LSTM (h = 128) 16.06 16.38 17.13 LSTM (h = 256) 16.08 16.37 17.12 LSTM (h = 512) 15.98 16.38 17.13 Test set size 94 103 1120 3D structure [43], meaning that sequence similarity need not necessarily be high. At the same time, single mutations may cause a protein to break or misfold, meaning that high sequence similarity isn t sufficient for a correct design. To deal with this, we will focus on three kinds of evaluation: (i) likelihood-based, where we test the ability of the generative model to give high likelihood to held out sequences, (ii) native sequence recovery, where we evaluate generated sequences vs the native sequences of templates, and (iii) experimental comparison, where we compare the likelihoods of the model to high-throughput data from a de novo protein design experiment. We find that our model is able to attain considerably improved statistical performance in its likelihoods while simultaneously providing more accurate and efficient sequence recovery. 4.1 Statistical comparison to likelihood-based models Protein perplexities What kind of perplexities might be useful? To provide context, we first present perplexities for some simple models of protein sequences in Table 1. The amino acid alphabet and its natural frequencies upper-bound perplexity at 20 and 17.8, respectively. Random protein sequences under these null models are unlikely to be functional without further selection [44]. First order profiles of protein sequences such as those from the Pfam database [45], however, are widely used for protein engineering. We found the average perplexity per letter of profiles in Pfam 32 (ignoring alignment uncertainty) to be 11.6. This suggests that even models with high perplexities high as 11 have the potential to be useful for the space of functional protein sequences. The importance of structure We found that there was a significant gap between unconditional language models of protein sequences and models conditioned on structure. Remarkably, for a range of structure-independent language models, the typical test perplexities are 16-17 (Table 2), which were barely better than null letter frequencies (Table 1). We emphasize that the RNNs were not broken and could still learn the training set in these capacity ranges. All structure-based models had (unsurprisingly) considerably lower perplexities. In particular, our Structured Transformer model attained a perplexity of 7 on the full test set. It seems that protein language models trained on one subset of 3D folds (in our cluster-splitting procedure) generalize poorly to predict the sequences Table 3: Ablation of graph features and model components. Test perplexities (lower is better). Node features Edge features Aggregation Short Single chain All Rigid backbone Dihedrals Distances, Orientations Attention 8.54 9.03 6.85 Dihedrals Distances, Orientations Pair MLP 8.33 8.86 6.55 Cα angles Distances, Orientations Attention 9.16 9.37 7.83 Dihedrals Distances Attention 9.11 9.63 7.87 Flexible backbone Cα angles Contacts, Hydrogen bonds Attention 11.71 11.81 11.51 Method Recovery (%) Speed (AA/s) CPU Speed (AA/s) GPU Rosetta 3.10 fixbb 17.9 4.88 10 1 N/A Ours (T = 0.1) 27.6 2.22 102 1.04 104 (a) Single chain test set (103 proteins) Method Recovery (%) Rosetta, fixbb 1 33.1 Rosetta, fixbb 2 38.4 Ours (T = 0.1) 39.2 (b) Ollikainen benchmark (40 proteins) Table 4: Improved reliability and speed compared to Rosetta. (a) On the single chain test set, our model more accurately recovers native sequences than Rosetta fixbb with greater speed (CPU: single core of Intel Xeon Gold 5115, GPU: NVIDIA RTX 2080). This set includes NMR-based structures for which Rosetta is known to not be robust [46]. (b) Our model also performs favorably on a prior benchmark of 40 proteins. All results reported as median of average over 100 designs. of unseen folds. We believe this possibility might be important to consider when training protein language models for protein engineering and design. Improvement over deep profile-based methods We also compared to a recent method SPIN2 that predicts, using deep neural networks, protein sequence profiles given protein structures [8]. Since SPIN2 is computationally intensive (minutes per protein for small proteins) and was trained on complete proteins rather than chains, we evaluated it on two subsets of the full test set: a Small subset of the test set containing chains up to length 100 and a Single chain subset containing only those models where the single chain accounted for the entire protein record in the Protein Data Bank. Both subsets discarded any chains with structural gaps (chain break). We found that our Structured Transformer model significantly improved upon the perplexities of SPIN2 (Table 2). Graph representations and attention mechanisms The graph-based formulation of protein design can accommodate very different formulations of the problem depending on how structure is represented by a graph. We tested different approaches for representing the protein including both more rigid design with precise geometric details, and flexible topological design based on spatial contacts and hydrogen bonding (Table 3). For the best perplexities, we found that using local orientation information was indeed important above simple distance measures. At the same time, even the topological features were sufficient to obtain better perplexities than SPIN2 (Table 2), which uses precise atomic details. In addition to varying the graph features, we also experimented with an alternative aggregation function from message passing neural networks [36].5 We found that a simple aggregation function hi = P j MLP(hj, hj, eij) led to the best performance of all models, where MLP( ) is a two layer perceptron that preserves the hidden dimension of the model. We speculate that this is due to potential overfitting by the attention mechanism. Although this suggests room for future improvements, we use multi-head self-attention throughout the remaining experiments. 4.2 Benchmarking protein redesign Decoding strategies Generating protein sequence designs requires a sampling scheme for drawing high-likelihood sequences from the model. While beam-search or top-k sampling [47] are commonly used heuristics for decoding, we found that simple biased sampling from the temperature adjusted distributions p(T )(s|x) = Q i p(si|x,s