# generative_models_for_graphbased_protein_design__9514433f.pdf

Generative models for graph-based protein design

John Ingraham, Vikas K. Garg, Regina Barzilay, Tommi Jaakkola Computer Science and Artiﬁcial Intelligence Lab, MIT {ingraham, vgarg, regina, tommi}@csail.mit.edu

Engineered proteins offer the potential to solve many problems in biomedicine, energy, and materials science, but creating designs that succeed is difﬁcult in practice. A signiﬁcant aspect of this challenge is the complex coupling between protein sequence and 3D structure, with the task of ﬁnding a viable design often referred to as the inverse protein folding problem. In this work, we introduce a conditional generative model for protein sequences given 3D structures based on graph representations. Our approach efﬁciently captures the complex dependencies in proteins by focusing on those that are long-range in sequence but local in 3D space. This graph-based approach improves in both speed and reliability over conventional and other neural network-based approaches, and takes a step toward rapid and targeted biomolecular design with the aid of deep generative models.

1 Introduction

A central goal for computational protein design is to automate the invention of protein molecules with deﬁned structural and functional properties. This ﬁeld has seen tremendous progess in the past two decades [1], including the design of novel 3D folds [2], enzymes [3], and complexes [4]. Despite these successes, current approaches are often unreliable, requiring multiple rounds of trial-and-error in which initial designs often fail [5, 6]. Moreover, diagnosing the origin of this unreliability is difﬁcult, as contemporary bottom-up approaches depend both on the accuracy of complex, composite energy functions for protein physics and also on the efﬁciency of sampling algorithms for jointly exploring the protein sequence and structure space.

Here, we explore an alternative, top-down framework for protein design that directly learns a conditional generative model for protein sequences given a speciﬁcation of the target structure, which is represented as a graph over the residues (amino acids). Speciﬁcally, we augment the autoregressive self-attention of recent sequence models [7] with graph-based representations of 3D molecular structure. By composing multiple layers of this structured self-attention, our model can effectively capture higher-order, interaction-based dependencies between sequence and structure, in contrast to previous parameteric approaches [8, 9] that are limited to only the ﬁrst-order effects.

A graph-structured sequence model offers several beneﬁts, including favorable computational efﬁciency, inductive bias, and representational ﬂexibility. We accomplish the ﬁrst two by leveraging a well-evidenced ﬁnding in protein science, namely that long-range dependencies in sequence are generally short-range in 3D space [10 12]. By making the graph and self-attention similarly sparse and localized in 3D space, we achieve computational scaling that is linear in sequence length. Additionally, graph structured inputs offer representational ﬂexibility, as they accomodate both coarse, ﬂexible backbone (connectivity and topology) as well as ﬁne-grained (precise atom locations) descriptions of structure.

We demonstrate the merits of our approach via a detailed empirical study. Speciﬁcally, we evaluate our model s performance for structural generalization to sequences of protein 3D folds that are topologically distinct from those in the training set. For ﬁxed-backbone sequence design, we ﬁnd that

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

our model achieves considerably improved statistical performance over a prior neural-network based model and also that it achieves higher accuracy and efﬁciency than Rosetta fixbb, a state-the-art program for protein design.

The rest of the paper is organized as follows. We ﬁrst position our contributions with respect to the prior work in Section 1.1. We provide details on our methods, including structure representation, in Section 2. We introduce our Structured Transformer model in Section 2.2. The details of our experiments are laid out in Section 3, and the corresponding results that elucidate the merits of our approach are presented in Section 4.

1.1 Related Work

Generative models for protein sequence and structure A number of works have explored the use of generative models for protein engineering and design [13]. [8, 9, 14] have used neural networkbased models for sequences given 3D structure, where the amino acids are modeled independently of one another. [15] introduced a generative model for protein sequences conditioned on a 1D, contextfree grammar based speciﬁcation of the fold topology. Multiple works [16, 17] have modeled the conditional distribution of single amino acids given surrounding structure and sequence context with convolutional neural networks. In contrast to these works, our model captures the joint distribution of the full protein sequence while grounding these dependencies in terms of long-range interactions arising from structure.

In parallel to the development of structure-based models, there has been considerable work on deep generative models for protein sequences in individual protein families [18 21]. While useful, these methods presume the availability of a large number of sequences from a particular family, which are unavailable in the case of designing novel proteins that diverge signiﬁcantly from natural sequences.

Several groups have obtained promising results using unconditional protein language models [22 25] to learn sequence representations that transfer well to supervised tasks. While serving different purposes, we emphasize that one advantage of conditional generative modeling is to facilitate adaptation to speciﬁc (and potentially novel) parts of structure space. Language models trained on hundreds of millions of evolutionary sequences will still be semantically bottlenecked by the thousands of 3D evolutionary folds that these sequences represent. We propose evaluating protein language models with structure-based splitting of sequence data, and begin to see how unconditional language models may struggle to assign high likelihoods to sequences from out-of-training folds.

In a complementary line of research, several deep and differentiable parameterizations of protein structure [26 29] have been recently proposed that could be used to generate, optimize, or validate 3D structures for input to sequence design.

Protein design and interaction graphs For classical approaches to computational protein design, which are based on joint modeling of structure and sequence, we refer the reader to a review of both methods and accomplishments in [1]. Many of the major ﬁrsts in protein design are due to Rosetta [30, 31], a leading framework for protein design. More recently, there have been successes with non-parametric approaches to protein design [32] which are based on ﬁnding substructural homologies between the target and diverse templates in large protein database. In this work, we focus on comparisons to Rosetta (Section 4.2), since it is based on a single parametric energy function for capturing the sequence-structure relationship.

Self-Attention Our model extends the Transformer [33] to capture sparse, pairwise relational information between sequence elements. The dense variation of this problem was explored in [34] and [35]. As noted in those works, incorporating general pairwise information incurs O(N 2) memory (and computational) cost for sequences of length N, which can be highly limiting for training on GPUs. We circumvent this cost by instead restricting the self-attention to the sparsity of the input graph. Given this graph-structured self-attention, our model may also be reasonably cast in the framework of message-passing or graph neural networks [36, 37] (Section 4.1). Our approach is similar to Graph Attention Networks [38], but augmented with edge features and an autoregressive decoder.

Structure G

Self-attention

Position-wise Feedforward

Node embeddings

Edge embeddings

Causal Self-attention

Position-wise Feedforward

Structure and sequence

Structure Encoder Sequence Decoder (autoregressive)

Node (amino acid)

Information flow

Figure 1: A graph-based, autoregressive model for protein sequences given 3D structures. (A) We cast protein design as language modeling conditioned on an input graph. In our architecture, an encoder develops a sequence-independent representation of 3D structure via multi-head self-attention [7] on the spatial k-nearest neighbors graph. A decoder then autoregressively generates each amino acid si given the full structure and previously decoded amino acids. (B) Each layer of the encoder and decoder contains a step of neighborhood aggregation (self-attention) and of local information processing (position-wise feedforward).

In this work, we introduce a Structured Transformer model that draws inspiration from the selfattention based Transformer model [7] and is augmented for scalable incorporation of relational information (Figure 1). While general relational attention incurs quadratic memory and computation costs, we avert these by restricting the attention for each node i to the set N(i, k) of its k-nearest neighbors in 3D space. Since our architecture is multilayered, iterated local attention can derive progressively more global estimates of context for each node i. Second, unlike the standard Transformer, we also include edge features to embed the spatial and positional dependencies in deriving the attention. Thus, our model generalizes Transformer to spatially structured settings.

2.1 Representing structure as a graph

We represent protein structure in terms of an attributed graph G = (V, E) with node features V = {v1, . . . , v N} describing each residue (amino acid, which are the letters which compose a protein sequence) and edge features E = {eij}i =j capturing relationships between them. This formulation can accommodate different variations on the macromolecular design problem, including both rigid backbone design where the precise coordinates of backbone atoms are ﬁxed, as well as ﬂexible backbone design where softer constraints such as blueprints of hydrogen-bonding connectivity [5] or 1D architectures [15] could deﬁne the structure of interest.

3D considerations For a rigid-body design problem, the structure for conditioning is a ﬁxed set of backbone coordinates X = {xi R3 : 1 i N}, where N is the number of positions1. We desire a graph representation of the coordinates G(X) that has two properties:

Invariance. The features are invariant to rotations and translations.

Locally informative. The edge features incident to vi due to its neighbors N(i, k), i.e. {eij}j N(i,k), contain sufﬁcient information to reconstruct all adjacent coordinates {xj}j N(i,k) up to rigid-body motion.

While invariance is motivated by standard symmetry considerations, the second property is motivated by limitations of current graph neural networks [36]. In these networks, updates to node features

1Here we consider a single representative coordinate per position when deriving edge features but may revisit multiple atom types per position for features such as backbone angles or hydrogen bonds.

Edge features

Sparse directions

Backbone structure

Distance Direction Rotation

Point cloud with local frames

Figure 2: Spatial features capture structural relationships across diverse folds. (A) The edge features of our most detailed protein graph representation capture the relative distance, direction, and orientation between two positions on the backbone. For scalability, all computation after an initially dense Euclidean distance calculation (right, top), such as relative directions (right, bottom) and neural steps, can be restricted to the k-Nearest Neighbors graph. (B) Example of topological variation in the dataset. Protein chains in train, test, and validation are split by the sub-chain CATH [40] topologies, which means that folds in each set will have distinct patterns of spatial connectivity.

vi depend only on the edge and node features adjacent to vi. However, typically, these features are insufﬁcient to reconstruct the relative neighborhood positions {xj}j N(i,k), so individual updates cannot fully depend on the local environment . For example, when reasoning about the neighborhood around coordinate xi, the pairwise distances Dia and Dib will be insufﬁcient to determine if xa and xb are on the same or opposite sides.

Relative spatial encodings We develop invariant and locally informative features by ﬁrst augmenting the points xi with orientations Oi that deﬁne a local coordinate system at each point (Figure 2). We deﬁne these in terms of the backbone geometry as

Oi = [bi ni bi ni] ,

where bi is the negative bisector of angle between the rays (xi 1 xi) and (xi+1 xi), and ni is a unit vector normal to that plane. Formally, we have

ui = xi xi 1 ||xi xi 1||, bi = ui ui+1 ||ui ui+1||, ni = ui ui+1 ||ui ui+1||.

Finally, we derive the spatial edge features e(s) ij from the rigid body transformation that relates reference frame (xi, Oi) to reference frame (xj, Oj). While this transformation has 6 degrees of freedom, we decompose it into features for distance, direction, and orientation as

e(s) ij = r (||xj xi||) , OT i xj xi ||xj xi||, q OT i Oj .

Here the ﬁrst vector is a distance encoding r( ) lifted into a radial basis2, the second vector is a direction encoding that corresponds to the relative direction of xj in the reference frame of (xi, Oi), and the third vector is an orientation encoding q( ) of the quaternion representation of the spatial rotation matrix OT i Oj. Quaternions represent 3D rotations as four-element vectors that can be efﬁciently and reasonably compared by inner products [39].3

Relative positional encodings As in the original Transformer, we also represent distances between residues in the sequence (rather than space) with positional embeddings e(p) ij . Speciﬁcally, we need to represent the positioning of each neighbor j relative to the node under consideration i. Therefore,

2We used 16 Gaussian RBFs isotropically spaced from 0 to 20 Angstroms. 3We represent quaternions in terms of their vector of real coefﬁcients.

we obtain the position embedding as a sinusoidal function of the gap i j. We retain the sign of the distance i j because protein sequences are generally asymmetric. These relative encodings contrast the absolute encodings of the original Transformer, but are consistent with modiﬁcations described in subsequent work [34].

Node and edge features Finally, we obtain an aggregate edge encoding vector eij by concatenating the structural encodings e(s) ij with the positional encodings e(p) ij and then linearly transforming them to have the same dimension as the model. We only include edges in the k-nearest neighbors graph of X, with k = 30 for all experiments. This k is generous, as typical deﬁnitions of residue-residue contacts in proteins will result in < 20 contacts per residue. For node features, we compute the three dihedral angles of the protein backbone (φi, ψi, ωi) and embed these on the 3-torus as {sin, cos} (φi, ψi, ωi).

Flexible backbone features We also consider ﬂexible backbone descriptions of 3D structure based on topological binary edge features and coarse backbone geometry. We combine the relative positional encodings with two binary edge features: contacts that indicate when the distance between Cα residues at i and j are less than 8 Angstroms and hydrogen bonds which are directed and deﬁned by the electrostatic model of DSSP [41]. For coarse node features, we compute virtual dihedral angles and bond angles between backbone Cα residues, interpret them as spherical coordinates, and represent them as points on the unit sphere.

2.2 Structured Transformer

Autoregressive decomposition We decompose the distribution of a protein sequence given a 3D structure as p(s|x) = Y

i p(si|x, s<i),

where the conditional probability p(si|x, s<i) of amino acid si at position i is depends both on the input structure x as well as on the preceding amino acids s<i = {s1, . . . si 1}.4 These conditionals are parameterized in terms of two sub-networks: an encoder that computes reﬁned node embeddings from structure-based node features V(x) and edge features E(x), and a decoder that autoregressively predicts letter si given the preceding sequence and structural embeddings from the encoder (Figure 1).

Encoder Our encoder module is designed as follows. A transformation Wh : Rdv 7 Rd produces initial embeddings hi = Wh(vi) from the node features vi pertaining to position i [N] {1, 2, . . . , N}.

Each layer of the encoder implements a multi-head self-attention component, where head ℓ [L] can attend to a separate subspace of the embeddings via learned query, key and value transformations [7]. The queries are derived from the current embedding at node i while the keys and values from the relational information rij = (hj, eij) at adjacent nodes j N(i, k). Speciﬁcally, W (ℓ) q maps hi to query embeddings q(ℓ) i , W (ℓ) z maps pairs rij to key embeddings z(ℓ) ij for j N(i, k), and W (ℓ) v maps

the same pairs rij to value embeddings v(ℓ) ij for each i [N], ℓ [L]. Decoupling the mappings for keys and values allows each to depend on different subspaces of the representation.

We compute the attention a(ℓ) ij between query q(ℓ) i and key z(ℓ) ij as a function of their scaled inner product:

a(ℓ) ij = exp(m(ℓ) ij ) X

j N(i,k) exp(m(ℓ) ij ) , where m(ℓ) ij = q(ℓ) i

The results of each attention head l are collected as the weighted sum

j N(i,k) a(ℓ) ij v(ℓ) ij ,

4We anticipate that alternative orderings for decoding the sequence may be favorable but leave this to future work.

Table 1: Null perplexities for common statistical models of proteins.

Null model Perplexity Conditioned on Uniform 20.00 - Natural frequencies 17.83 Random position in a natural protein Pfam HMM proﬁles 11.64 Speciﬁc position in a speciﬁc protein family

and then concatenated and transformed to give the update

hi = Wo Concat h(1) i , . . . , h(L) i .

We update the embeddings with this residual, and alternate between these self-attention layers and position-wise feedforward layers as in the original Transformer [7]. We stack multiple layers atop each other, and thereby obtain continually reﬁned embeddings as we traverse the layers bottom up. The encoder yields the embeddings produced by the topmost layer as its output.

Decoder Our decoder module has the same structure as the encoder but with augmented relational information rij that allows access to the preceding sequence elements s<i in a causally consistent manner. In contrast to the encoder, where the keys and values are based on the relational information rij = (hj, eij), the decoder can additionally access sequence elements sj as

r(dec) ij =

( (h(dec) j , eij, g(sj)) i > j (h(enc) j , eij, 0) i j .

Here h(dec) j is the embedding of node j in the current layer of the decoder, h(enc) j is the embedding of node j in the ﬁnal layer of the encoder, and g(sj) is a sequence embedding of amino acid sj at node j. This concatenation and masking structure ensures that sequence information only ﬂows to position i from positions j < i, but still allows position i to attend to subsequent structural information unlike the standard Transformer decoder.

We now demonstrate the merits of our approach via a detailed empirical analysis. We begin with the experimental set up including our architecture, and description of the data used in our experiments.

Architecture In all experiments, we used three layers of self-attention and position-wise feedforward modules for the encoder and decoder with a hidden dimension of 128.

Optimization We trained models using the learning rate schedule and initialization of the original Transformer paper [7], a dropout rate of 10% [42], a label smoothing rate of 10%, and early stopping based on validation perplexity. The unconditional language models did not include dropout or label smoothing.

Dataset To evaluate the ability of our models to generalize across different protein folds, we collected a dataset based on the CATH hierarchical classiﬁcation of protein structure [40]. For all domains in the CATH 4.2 40% non-redundant set of proteins, we obtained full chains up to length 500 and then randomly assigned their CATH topology classiﬁcations (CAT codes) to train, validation and test sets at a targeted 80/10/10 split. Since each chain can contain multiple CAT codes, we ﬁrst removed any redundant entries from train and then from validation. Finally, we removed any chains from the test set that had CAT overlap with train and removed chains from the validation set with CAT overlap to train or test. This resulted in a dataset of 18024 chains in the training set, 608 chains in the validation set, and 1120 chains in the test set. There is zero CAT overlap between these sets.

A challenge in evaluating computational protein design methods is the degeneracy of the relationship between protein structure and sequence. Many protein sequences may reasonably design the same

Table 2: Per-residue perplexities for protein language modeling (lower is better). The protein chains have been cluster-split by CATH topology, such that test includes only unseen 3D folds. While a structure-conditioned language model can generalize in this structure-split setting, unconditional language models struggle.

Test set Short Single chain All Structure-conditioned models Structured Transformer (ours) 8.54 9.03 6.85 SPIN2 [8] 12.11 12.61 - Language models LSTM (h = 128) 16.06 16.38 17.13 LSTM (h = 256) 16.08 16.37 17.12 LSTM (h = 512) 15.98 16.38 17.13 Test set size 94 103 1120

3D structure [43], meaning that sequence similarity need not necessarily be high. At the same time, single mutations may cause a protein to break or misfold, meaning that high sequence similarity isn t sufﬁcient for a correct design. To deal with this, we will focus on three kinds of evaluation: (i) likelihood-based, where we test the ability of the generative model to give high likelihood to held out sequences, (ii) native sequence recovery, where we evaluate generated sequences vs the native sequences of templates, and (iii) experimental comparison, where we compare the likelihoods of the model to high-throughput data from a de novo protein design experiment.

We ﬁnd that our model is able to attain considerably improved statistical performance in its likelihoods while simultaneously providing more accurate and efﬁcient sequence recovery.

4.1 Statistical comparison to likelihood-based models

Protein perplexities What kind of perplexities might be useful? To provide context, we ﬁrst present perplexities for some simple models of protein sequences in Table 1. The amino acid alphabet and its natural frequencies upper-bound perplexity at 20 and 17.8, respectively. Random protein sequences under these null models are unlikely to be functional without further selection [44]. First order proﬁles of protein sequences such as those from the Pfam database [45], however, are widely used for protein engineering. We found the average perplexity per letter of proﬁles in Pfam 32 (ignoring alignment uncertainty) to be 11.6. This suggests that even models with high perplexities high as 11 have the potential to be useful for the space of functional protein sequences.

The importance of structure We found that there was a signiﬁcant gap between unconditional language models of protein sequences and models conditioned on structure. Remarkably, for a range of structure-independent language models, the typical test perplexities are 16-17 (Table 2), which were barely better than null letter frequencies (Table 1). We emphasize that the RNNs were not broken and could still learn the training set in these capacity ranges. All structure-based models had (unsurprisingly) considerably lower perplexities. In particular, our Structured Transformer model attained a perplexity of 7 on the full test set. It seems that protein language models trained on one subset of 3D folds (in our cluster-splitting procedure) generalize poorly to predict the sequences

Table 3: Ablation of graph features and model components. Test perplexities (lower is better).

Node features Edge features Aggregation Short Single chain All Rigid backbone Dihedrals Distances, Orientations Attention 8.54 9.03 6.85 Dihedrals Distances, Orientations Pair MLP 8.33 8.86 6.55 Cα angles Distances, Orientations Attention 9.16 9.37 7.83 Dihedrals Distances Attention 9.11 9.63 7.87 Flexible backbone Cα angles Contacts, Hydrogen bonds Attention 11.71 11.81 11.51

Method Recovery (%) Speed (AA/s) CPU Speed (AA/s) GPU Rosetta 3.10 fixbb 17.9 4.88 10 1 N/A Ours (T = 0.1) 27.6 2.22 102 1.04 104

(a) Single chain test set (103 proteins)

Method Recovery (%) Rosetta, fixbb 1 33.1 Rosetta, fixbb 2 38.4 Ours (T = 0.1) 39.2

(b) Ollikainen benchmark (40 proteins)

Table 4: Improved reliability and speed compared to Rosetta. (a) On the single chain test set, our model more accurately recovers native sequences than Rosetta fixbb with greater speed (CPU: single core of Intel Xeon Gold 5115, GPU: NVIDIA RTX 2080). This set includes NMR-based structures for which Rosetta is known to not be robust [46]. (b) Our model also performs favorably on a prior benchmark of 40 proteins. All results reported as median of average over 100 designs.

of unseen folds. We believe this possibility might be important to consider when training protein language models for protein engineering and design.

Improvement over deep proﬁle-based methods We also compared to a recent method SPIN2 that predicts, using deep neural networks, protein sequence proﬁles given protein structures [8]. Since SPIN2 is computationally intensive (minutes per protein for small proteins) and was trained on complete proteins rather than chains, we evaluated it on two subsets of the full test set: a Small subset of the test set containing chains up to length 100 and a Single chain subset containing only those models where the single chain accounted for the entire protein record in the Protein Data Bank. Both subsets discarded any chains with structural gaps (chain break). We found that our Structured Transformer model signiﬁcantly improved upon the perplexities of SPIN2 (Table 2).

Graph representations and attention mechanisms The graph-based formulation of protein design can accommodate very different formulations of the problem depending on how structure is represented by a graph. We tested different approaches for representing the protein including both more rigid design with precise geometric details, and ﬂexible topological design based on spatial contacts and hydrogen bonding (Table 3). For the best perplexities, we found that using local orientation information was indeed important above simple distance measures. At the same time, even the topological features were sufﬁcient to obtain better perplexities than SPIN2 (Table 2), which uses precise atomic details.

In addition to varying the graph features, we also experimented with an alternative aggregation function from message passing neural networks [36].5 We found that a simple aggregation function hi = P

j MLP(hj, hj, eij) led to the best performance of all models, where MLP( ) is a two layer perceptron that preserves the hidden dimension of the model. We speculate that this is due to potential overﬁtting by the attention mechanism. Although this suggests room for future improvements, we use multi-head self-attention throughout the remaining experiments.

4.2 Benchmarking protein redesign

Decoding strategies Generating protein sequence designs requires a sampling scheme for drawing high-likelihood sequences from the model. While beam-search or top-k sampling [47] are commonly used heuristics for decoding, we found that simple biased sampling from the temperature adjusted distributions p(T )(s|x) = Q

i p(si|x,s<i)1/T P

a p(a|x,s<i)1/T was sufﬁcient for obtaining sequences with higher likelihoods than native. We used a temperature of T = 0.1 selected from sequence recovery on validation. For conditional redesign of a subset of positions in a protein, we speculate that the likelihood calculation is sufﬁciently fast such that MCMC-based approaches such as Gibbs sampling may be feasible.

5We thank one of our reviewers for this suggestion.

Table 5: Structure-conditioned likelihoods correlate with mutation effects in de novo-designed miniproteins. Shown are Pearson correlation coefﬁcients (R, higher is better) between the loglikelihoods of mutated sequences and high-throughput mutation effect data from a systematic design of miniproteins [6]. Each design (column) includes 775 experimentally tested mutant protein sequences.

Design ββαββ37 ββαββ1498 ββαββ1702 ββαββ1716 αββα779 Rigid backbone 0.47 0.45 0.12 0.47 0.57 Flexible backbone 0.50 0.44 0.17 0.40 0.56

Design αββα223 αββα726 αββα872 ααα134 ααα138 Rigid backbone 0.36 0.11 0.21 0.24 0.33 Flexible backbone 0.33 0.21 0.23 0.36 0.41

Comparison to Rosetta To evaluate the performance of our model at generating realistic protein sequences, we performed two experiments that compare with Rosetta [30], a state-of-the-art framework for computational protein design. We found that our model was more accurate and signiﬁcantly faster than Rosetta (Table 4). In the ﬁrst, we used the latest version of Rosetta (3.10) to design sequences for our Single chain test set with the ﬁxbb ﬁxed-backbone design protocol and default parameters (Table 4a). In the second, we also compared to a prior benchmark from members of the Rosetta community [48, 49] across 40 diverse proteins. For this set of proteins, we re-split our dataset to form new training and validation sets with no CAT overlap to the 40 templates for design. Although this reduced the size of the training set from 18,000 to 10,000 chains, we found our model to be both more accurate than and several orders of magnitude faster than Rosetta (Table 4b).

4.3 Unsupervised anomaly detection for experimental protein design

We can also measure what our structure-conditioned language model knows about protein function by comparing the likelihoods it assigns to functional and non-functional mutant proteins from recent high-throughput design experiments. This may be seen as a kind of evolutionary unsupervised anomaly detection [18]. We compare to a recent high-throughput design and mutagenesis experiment in which several de novo designed mini-proteins were subject to systematic mutagenesis to all possible point mutations [6]. We ﬁnd that the log-likelihoods of our model non-trivially reﬂect mutational preferences of designed proteins (Table 5). Importantly, we see that the performance is not dependent on precise 3D geometric features (e.g. distances and orientations) but can also be realized with coarse information (e.g. contacts, hydrogen bonds, and coarse backbone angles).

5 Conclusion

We introduced a new generative model for designing protein sequences based on a graphrepresentation of their 3D structures6. Our model augments the traditional sequence-level selfattention of Transformers [7] with graph-based, 3D structural encodings and is able to leverage the spatial locality of dependencies in molecular structures for efﬁcient computation. When evaluated on unseen folds, our model achieves signiﬁcantly improved perplexities over recent neural network-based generative models and generates sequences with improved accuracy and speed over a state-of-art program Rosetta.

Our framework suggests the possibility of being able to efﬁciently design and engineer protein sequences with structurally-guided deep generative models, and underscores the central role of modeling sparse, long-range dependencies in biological sequences.

Acknowledgments

We thank members of the MIT MLPDS consortium, the MIT NLP group, and the reviewers for helpful feedback. This work was supported by the Machine Learning for Pharmaceutical Discovery and Synthesis (MLPDS) consortium.

6Code is available at github.com/jingraham/neurips19-graph-protein-design.

[1] Po-Ssu Huang, Scott E Boyken, and David Baker. The coming of age of de novo protein design. Nature, 537(7620):320, 2016.

[2] Brian Kuhlman, Gautam Dantas, Gregory C Ireton, Gabriele Varani, Barry L Stoddard, and David Baker. Design of a novel globular protein fold with atomic-level accuracy. Science, 302(5649):1364 1368, 2003.

[3] Justin B Siegel, Alexandre Zanghellini, Helena M Lovick, Gert Kiss, Abigail R Lambert, Jennifer L St Clair, Jasmine L Gallaher, Donald Hilvert, Michael H Gelb, Barry L Stoddard, et al. Computational design of an enzyme catalyst for a stereoselective bimolecular diels-alder reaction. Science, 329(5989):309 313, 2010.

[4] Jacob B Bale, Shane Gonen, Yuxi Liu, William Shefﬂer, Daniel Ellis, Chantz Thomas, Duilio Cascio, Todd O Yeates, Tamir Gonen, Neil P King, et al. Accurate design of megadalton-scale two-component icosahedral protein complexes. Science, 353(6297):389 394, 2016.

[5] Nobuyasu Koga, Rie Tatsumi-Koga, Gaohua Liu, Rong Xiao, Thomas B Acton, Gaetano T Montelione, and David Baker. Principles for designing ideal protein structures. Nature, 491(7423):222, 2012.

[6] Gabriel J Rocklin, Tamuka M Chidyausiku, Inna Goreshnik, Alex Ford, Scott Houliston, Alexander Lemak, Lauren Carter, Rashmi Ravichandran, Vikram K Mulligan, Aaron Chevalier, et al. Global analysis of protein folding using massively parallel design, synthesis, and testing. Science, 357(6347):168 175, 2017.

[7] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998 6008, 2017.

[8] James O Connell, Zhixiu Li, Jack Hanson, Rhys Heffernan, James Lyons, Kuldip Paliwal, Abdollah Dehzangi, Yuedong Yang, and Yaoqi Zhou. Spin2: Predicting sequence proﬁles from protein structures using deep neural networks. Proteins: Structure, Function, and Bioinformatics, 86(6):629 633, 2018.

[9] Jingxue Wang, Huali Cao, John ZH Zhang, and Yifei Qi. Computational protein design with deep learning neural networks. Scientiﬁc reports, 8(1):6349, 2018.

[10] Debora S Marks, Lucy J Colwell, Robert Sheridan, Thomas A Hopf, Andrea Pagnani, Riccardo Zecchina, and Chris Sander. Protein 3d structure computed from evolutionary sequence variation. PLo S one, 6(12):e28766, 2011.

[11] Faruck Morcos, Andrea Pagnani, Bryan Lunt, Arianna Bertolino, Debora S Marks, Chris Sander, Riccardo Zecchina, José N Onuchic, Terence Hwa, and Martin Weigt. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proceedings of the National Academy of Sciences, 108(49):E1293 E1301, 2011.

[12] Sivaraman Balakrishnan, Hetunandan Kamisetty, Jaime G Carbonell, Su-In Lee, and Christopher James Langmead. Learning generative models for protein fold families. Proteins: Structure, Function, and Bioinformatics, 79(4):1061 1078, 2011.

[13] Kevin K Yang, Zachary Wu, and Frances H Arnold. Machine-learning-guided directed evolution for protein engineering. Nature Methods, 16(8):687 694, 2019.

[14] Sheng Chen, Zhe Sun, Yutong Lu, Huiying Zhao, and Yuedong Yang. To improve protein sequence proﬁle prediction through image captioning on pairwise residue distance map. bio Rxiv, page 628917, 2019.

[15] Joe G Greener, Lewis Moffat, and David T Jones. Design of metalloproteins and novel protein folds using variational autoencoders. Scientiﬁc reports, 8(1):16189, 2018.

[16] Wouter Boomsma and Jes Frellsen. Spherical convolutions and their application in molecular modelling. In Advances in Neural Information Processing Systems, pages 3433 3443, 2017.

[17] Maurice Weiler, Mario Geiger, Max Welling, Wouter Boomsma, and Taco Cohen. 3d steerable cnns: Learning rotationally equivariant features in volumetric data. In Advances in Neural Information Processing Systems, pages 10402 10413, 2018.

[18] Adam J Riesselman, John B Ingraham, and Debora S Marks. Deep generative models of genetic variation capture the effects of mutations. Nature Methods, 15:816 822, 2018.

[19] Sam Sinai, Eric Kelsic, George M Church, and Martin A Nowak. Variational auto-encoding of protein sequences. ar Xiv preprint ar Xiv:1712.03346, 2017.

[20] Jérôme Tubiana, Simona Cocco, and Rémi Monasson. Learning protein constitutive motifs from sequence data. e Life, 8:e39397, 2019.

[21] Adam J Riesselman, Jung-Eun Shin, Aaron W Kollasch, Conor Mc Mahon, Elana Simon, Chris Sander, Aashish Manglik, Andrew C Kruse, and Debora S Marks. Accelerating protein design using autoregressive generative models. bio Rxiv, page 757252, 2019.

[22] Tristan Bepler and Bonnie Berger. Learning protein sequence embeddings using information from structure. In International Conference on Learning Representations, 2019.

[23] Ethan C Alley, Grigory Khimulya, Surojit Biswas, Mohammed Al Quraishi, and George M Church. Uniﬁed rational protein engineering with sequence-based deep representation learning. Nature Methods, pages 1 8, 2019.

[24] Michael Heinzinger, Ahmed Elnaggar, Yu Wang, Christian Dallago, Dmitrii Nachaev, Florian Matthes, and Burkhard Rost. Modeling the language of life-deep learning protein sequences. bio Rxiv, page 614313, 2019.

[25] Alexander Rives, Siddharth Goyal, Joshua Meier, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. bio Rxiv, 2019.

[26] Namrata Anand and Possu Huang. Generative modeling for protein structures. In Advances in Neural Information Processing Systems, pages 7505 7516, 2018.

[27] Namrata Anand, Raphael Eguchi, and Po-Ssu Huang. Fully differentiable full-atom protein backbone generation. In ICLR Workshop on Deep Generative Models for Highly Structured Data, 2019.

[28] John Ingraham, Adam Riesselman, Chris Sander, and Debora Marks. Learning protein structure with a differentiable simulator. In International Conference on Learning Representations, 2019.

[29] Mohammed Al Quraishi. End-to-end differentiable learning of protein structure. Cell Systems, 8(4):292 301, 2019.

[30] Andrew Leaver-Fay, Michael Tyka, Steven M Lewis, Oliver F Lange, James Thompson, Ron Jacak, Kristian W Kaufman, P Douglas Renfrew, Colin A Smith, Will Shefﬂer, et al. Rosetta3: an object-oriented software suite for the simulation and design of macromolecules. In Methods in enzymology, volume 487, pages 545 574. Elsevier, 2011.

[31] Rebecca F Alford, Andrew Leaver-Fay, Jeliazko R Jeliazkov, Matthew J O Meara, Frank P Di Maio, Hahnbeom Park, Maxim V Shapovalov, P Douglas Renfrew, Vikram K Mulligan, Kalli Kappel, et al. The rosetta all-atom energy function for macromolecular modeling and design. Journal of chemical theory and computation, 13(6):3031 3048, 2017.

[32] Jianfu Zhou, Alexandra E Panaitiu, and Gevorg Grigoryan. A general-purpose protein design framework based on mining sequence-structure relationships in known protein structures. bio Rxiv, page 431635, 2018.

[33] Surojit Biswas, Gleb Kuznetsov, Pierce J Ogden, Nicholas J Conway, Ryan P Adams, and George M Church. Toward machine-guided design of proteins. bio Rxiv, page 337154, 2018.

[34] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), volume 2, pages 464 468, 2018.

[35] Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Curtis Hawthorne, Andrew M Dai, Matthew D Hoffman, and Douglas Eck. An improved relative self-attention mechanism for transformer with application to music generation. ar Xiv preprint ar Xiv:1809.04281, 2018.

[36] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning Volume 70, pages 1263 1272. JMLR. org, 2017.

[37] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. ar Xiv preprint ar Xiv:1806.01261, 2018.

[38] Petar Veliˇckovi c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. ar Xiv preprint ar Xiv:1710.10903, 2017.

[39] Du Q Huynh. Metrics for 3d rotations: Comparison and analysis. Journal of Mathematical Imaging and Vision, 35(2):155 164, 2009.

[40] Christine A Orengo, AD Michie, S Jones, David T Jones, MB Swindells, and Janet M Thornton. Cath a hierarchic classiﬁcation of protein domain structures. Structure, 5(8):1093 1109, 1997.

[41] Wolfgang Kabsch and Christian Sander. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22(12):2577 2637, 1983.

[42] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting. The Journal of Machine Learning Research, 15(1):1929 1958, 2014.

[43] Hao Li, Robert Helling, Chao Tang, and Ned Wingreen. Emergence of preferred structures in a simple model of protein folding. Science, 273(5275):666 669, 1996.

[44] Anthony D Keefe and Jack W Szostak. Functional proteins from a random-sequence library. Nature, 410(6829):715, 2001.

[45] Sara El-Gebali, Jaina Mistry, Alex Bateman, Sean R Eddy, Aurélien Luciani, Simon C Potter, Matloob Qureshi, Lorna J Richardson, Gustavo A Salazar, Alfredo Smart, et al. The pfam protein families database in 2019. Nucleic acids research, 47(D1):D427 D432, 2018.

[46] Michael Schneider, Xiaoran Fu, and Amy E Keating. X-ray vs. nmr structures as templates for computational protein design. Proteins: Structure, Function, and Bioinformatics, 77(1):97 110, 2009.

[47] Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. ar Xiv preprint ar Xiv:1805.04833, 2018.

[48] Noah Ollikainen and Tanja Kortemme. Computational protein design quantiﬁes structural constraints on amino acid covariation. PLo S computational biology, 9(11):e1003313, 2013.

[49] Shane Ó Conchúir, Kyle A Barlow, Roland A Pache, Noah Ollikainen, Kale Kundert, Matthew J O Meara, Colin A Smith, and Tanja Kortemme. A web resource for standardized benchmark datasets, metrics, and rosetta protocols for macromolecular modeling and design. PLo S one, 10(9):e0130433, 2015.