# learning_mutational_semantics__e5919888.pdf

Learning Mutational Semantics

Brian Hie MIT brianhie@mit.edu

Ellen D. Zhong MIT zhonge@mit.edu

Bryan D. Bryson MIT bryand@mit.edu

Bonnie Berger MIT bab@mit.edu

In many natural domains, changing a small part of an entity can transform its semantics; for example, a single word change can alter the meaning of a sentence, or a single amino acid change can mutate a viral protein to escape antiviral treatment or immunity. Although identifying such mutations can be desirable (for example, therapeutic design that anticipates avenues of viral escape), the rules governing semantic change are often hard to quantify. Here, we introduce the problem of identifying mutations with a large effect on semantics, but where valid mutations are under complex constraints (for example, English grammar or biological viability), which we refer to as constrained semantic change search (CSCS). We propose an unsupervised solution based on language models that simultaneously learn continuous latent representations. We report good empirical performance on CSCS of single-word mutations to news headlines, map a continuous semantic space of viral variation, and, notably, show unprecedented zero-shot prediction of singleresidue escape mutations to key inﬂuenza and HIV proteins, suggesting a productive link between modeling natural language and pathogenic evolution.1

1 Introduction

Much of the effort devoted to learning machine-intelligible representations of natural language semantics has been built on the distributional hypothesis, in which the context and co-occurrence of words is assumed to provide insight into the meaning of words [25, 22, 35, 38, 41, 43]. While distributional semantics was developed to model human intuitive notions of meaning, similar reasoning may be useful for domains beyond human intuition.

For example, like linguistic semantics, biological function is encoded by a sequence of tokens (the bases of nucleic acids or the amino acid residues of proteins) that is determined by a complex distributional structure. Promisingly, recent analyses of biological sequence inspired by tools for modeling natural language have been shown to improve prediction of biological function [9, 45, 5].

A pressing and still poorly understood biological problem is understanding how rapidly mutating viral proteins can evade recognition by escaping the immune system s antibodies. Viral escape, which can be caused by even a single-residue change, has prevented the development of a universal antibody-based vaccine for inﬂuenza [30, 33] or human immunodeﬁciency virus (HIV) [6]. However, the rules governing viral ﬁtness are complex and a biological experiment that empirically tests the escape potential of all mutations to all viral strains would be prohibitively expensive. A key concept underlying this study is that, in order to escape the immune system, a mutation must not only preserve

1Code at https://github.com/brianhie/mutational-semantics-neurips2020.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

viral infectivity (i.e., it must be grammatical ) but it must also be functionally altered so that it is no longer recognized by the immune system s antibodies (i.e., it must have substantial semantic change ).

Here, we introduce the problem of searching for sequence mutations based on both high semantic change and grammatical validity, which we call constrained semantic change search (CSCS). This is in contrast to settings concerned with semantic similarity search, rather than change. To gain intuition, we apply CSCS to natural language and, to demonstrate broader impact, we apply CSCS to predict viral escape. Our key contributions are (1) we introduce the CSCS problem formulation and show how learned language models offer a compelling solution with strong empirical results on both natural language and biological applications, suggesting that the distributional hypothesis from linguistics is also useful for modeling pathogenic evolution; (2) we develop an unsupervised neural language model for viral proteins and show that it learns semantically meaningful embeddings; and (3) we use CSCS for zero-shot prediction of escape mutations for inﬂuenza and for HIV with quantitative results much higher than baseline methods. To our knowledge, we present the ﬁrst computational model that effectively predicts viral escape, potentially enabling vaccine or therapeutic design that anticipates escape before it occurs.

2.1 Problem Formulation

Intuitively, our goal is to identify mutations that induce high semantic change (e.g., a large impact on biological function) while being grammatically acceptable (e.g, biologically viable). More precisely, we are given a sequence of tokens deﬁned as x (x1, ..., x N) such that xi X, i [N], where X is a ﬁnite alphabet (e.g., characters or words for natural language, or amino acids for protein sequence). Let xi denote a mutation at position i and the mutated sequence as x[ xi] (..., xi 1, xi, xi+1, ...).

We ﬁrst require a semantic embedding z fs(x), where fs : X N RK embeds discrete-alphabet sequences into a continuous space, where, ideally, closeness in embedding space would correspond to semantic similarity. We denote semantic change as the distance in embedding space, i.e.,

z[ xi] z z[ xi] = fs(x) fs(x[ xi]) (1)

where denotes a vector norm. The grammaticality of a mutation is described by

p( xi|x), (2)

which takes values close to zero if x[ xi] is not grammatical and close to one if it is grammatical.

Our objective combines semantic change and grammaticality as a linear combination

a( xi; x) z[ xi] + βp( xi|x)

for each possible mutation xi and a user-speciﬁed parameter β [0, ). Mutations xi are prioritized based on a( xi; x). We refer to ranking mutations based on semantic change and grammaticality as CSCS.

2.2 Algorithms

2.2.1 Language Modeling

Algorithms for CSCS could potentially take many forms; for example, separate algorithms could be used to compute z[ xi] and p( xi|x) independently, or a two-step approach might be possible that computes one of the terms based on the value of the other.

Instead, we reasoned that a single approach could compute both terms simultaneously, based on learned language models that learn the probability distribution of a word given its context [38, 15, 43, 16, 44]. The language model we use throughout our experiments considers the full sequence context of a word and learns a latent variable probability distribution ˆp and function ˆfs, where, for all i [N], ˆp(xi|x[N]\{i}, ˆzi) = ˆp(xi|ˆzi) and ˆzi = ˆfs(x[N]\{i}),

Figure 1: Constrained semantic change search (CSCS) for viral escape prediction. Left: Given an input sequence x and its semantics encoded by z, CSCS aims to ﬁnd a mutation to x CSCS that causes the largest semantic change (high z), while remaining grammatical (high ˆp(x)). Right: Language model architecture with two stacked Bi LSTM layers instantiating the semantic embedding function ˆf, with the ﬁnal language model output used as grammaticality.

i.e., latent variable ˆzi encodes the context x[N]\{i} (..., xi 1, xi+1, ...) such that xi is conditionally independent of its context given the value of ˆzi.

We use different aspects of the language model to describe semantic change and grammaticality by setting terms (1) and (2) as

z[ xi] ˆz ˆz[ xi] 1 and p( xi|x) ˆp( xi|ˆzi),

where ˆz ˆz T 1 ˆz T N T is the concatenation of embeddings for each token, ˆz[ xi] is deﬁned similarly but for the mutated sequence, and 1 is the ℓ1 norm, chosen because of more favorable properties compared to other standard distance metrics, though other metrics could be empirically quantiﬁed in future work [2].

Effectively, distances in embedding space are used to approximate semantic change and the emitted probability approximates grammaticality. We note that these modeling assumptions are not guaranteed to be perfectly speciﬁed, since, in the natural language setting for example, antonyms may also be close in embedding space and the language model output can also encode linguistic pragmatics in addition to grammaticality. However, we still ﬁnd these modeling assumptions to have good empirical support.

Training or parameterizing the language model is separate from CSCS, and the novelty of CSCS is in leveraging these models in a new way. An advantage of this approach is that it does not require any bespoke modiﬁcations to the general language modeling framework, other than requiring a continuous latent variable. CSCS can therefore leverage the noted multitask generality of language models [44].

Importantly, this approach to CSCS is completely unsupervised. Rather than assume access to labels explicitly encoding semantics or grammaticality, the model instead extracts this information from a large unlabeled corpus. This is critical in domains, like viral genomics, in which large sequence corpuses are available but functional proﬁling is limited. These corpuses implicitly contain information related to grammaticality or infectivity (e.g., all sequences are grammatically acceptable or come from infectious virus), but the algorithm must learn these rules from data.

2.2.2 Architecture

Based on the success of recurrent architectures for protein-sequence representation learning [9, 45, 5], we use similar encoder models for viral protein sequences (Figure 1). Our model passes the full context sequence into bidirectional long-short-term-memory (Bi LSTM) hidden layers. We used the concatenated output of the ﬁnal LSTM layers as the semantic embedding, i.e.,

ˆzi LSTMf(gf(x1, ..., xi 1))T LSTMr(gr(xi+1, ..., x N))T T

where gf is the output of the preceding forward-directed layer, LSTMf is the ﬁnal forward-directed LSTM layer, and gr and LSTMr are the corresponding reverse-directed components. The ﬁnal output

probability is a softmax-transformed linear transformation of ˆzi, i.e.,

ˆp(xi|x[N]\{i}) softmax(Wˆzi + b)

for some learned model parameters W and b. In our experiments, we used a 20-dimensional dense embedding for each element in the alphabet X, two Bi LSTM layers with 512 units, and categorical cross entropy loss optimized by Adam with a learning rate of 0.001, β1 = 0.9, and β2 = 0.999. Additional details on hyperparameter selection are given in Appendix 6.3.1.

2.2.3 Rank-Based Acquisition

Rather than acquiring mutations based on raw semantic change and grammaticality values, which may be on very different scales, we ﬁnd that selecting β is much easier in practice when ﬁrst rank-transforming the semantic change and grammaticality terms, i.e., acquiring based on

a ( xi; x) rank( z[ xi]) + β rank(p( xi|x)).

All possible mutations xi are then given priority based on the corresponding values of a ( xi; x), from highest to lowest. Our empirical results have consistently good performance by simply setting β = 1 (equally weighting both terms), which we used in all experiments below unless otherwise noted. In this study, we deal with the unsupervised setting where β is a parameter but note that adding some supervision could learn β (or other, non-rank, transformations) from data.

2.2.4 Connection to Viral Escape

A language model is a probability distribution over sequences learned from a corpus of data. For any sequence x, the model will output a predicted probability p(x) of observing that sequence in the training data distribution. We call p(x) grammaticality because in natural language tasks, p(x) tends to be high for grammatically correct sentences. In the case of viral sequences, the training distribution consists of viral proteins that have evolved for high ﬁtness/virality, so we hypothesize that high grammaticality corresponds to high viral ﬁtness.

However, high ﬁtness alone does not indicate an escape mutation. For example, a viral protein with a neutral mutation will have equally high ﬁtness but may not look different enough to escape detection by the immune system, i.e., it will have no antigenic change. To identify mutations that do lead to large antigenic changes, we exploit the internal sequence embeddings learned by the language model. If two sequences have similar embeddings, then they have similar distributions over sequence continuations given the input tokens. As a natural-language example, the men advance , the soldiers advance , and the three advance have a similar set of possible word continuations and would have similar embeddings, while the cash advance has a nearly disjoint set of continuations and thus a different embedding. We hypothesize that neutral mutations should not affect the distribution over amino acids at other positions, while mutations that affect antigenicity do affect the distribution over other positions. Thus, the combination of high sequence probability (high ﬁtness) and a large change in embedding (antigenic change) indicates an escape mutation.

3 Related Work

The CSCS problem is related to work focused on identifying the best interventions to structured data to produce a desired outcome [40, 42]. Such work often assumes a dataset that includes both the observed features and corresponding outcomes, which allows for supervised learning. In contrast, we assume no explicit labels of semantic change and must resort to unsupervised learning to extract this information. This is because in domains like viral mutation, data that directly measures viral ﬁtness is very limited, while unlabeled sequence data is abundant.

Importantly, our CSCS task is distinct from representation learning tasks that construct semantically meaningful embeddings, but CSCS does stand to beneﬁt from innovation in representation learning. Using hidden states in a language model to represent natural language semantics has been an inﬂuential and productive idea [43]. Rather than acquiring mutations based on greatest semantic change as in CSCS, acquisition based instead only on lowest z[ xi] essentially performs semantic similarity search among all sequences that differ by a single token.

In biological applications, neural language models have been developed to learn unsupervised or weakly supervised protein sequence embeddings that encode generic protein similarity [9, 45, 5]. To

Original: australian dead in bali Original: winegrowers revel in good season CSCS: australian ballet in bali CSCS: winegrowers revel in flu season Original: nauru bans transhipments to tackle overfishing CSCS: nauru bans continue to tackle overfishing

Figure 2: Example CSCS-proposed mutations to news headlines show large changes to the headline meaning or to the syntactic part-of-speech structure.

our knowledge, however, no previous work has considered how mutations affect these embeddings, nor have such methods been applied to evolutionary change. Furthermore, while many variants of recurrent or transformer-based architectures have been proposed for protein sequence modelling tasks, we note any such current or future language model architecture could be used in CSCS.

Some work in computational biology has focused on identifying deleterious mutations in human or mammalian genomes with clinical relevance [51, 46]. However, these approaches are based on direct supervision under the assumption that rare or poorly conserved mutations are deleterious. Such an assumption, however, does not apply to escape mutations, which could be both frequent or infrequent in a population. Viral genomes are also more highly variable than mammalian genomes (e.g., Drake s rule ), so aligning mutations across viral strains is more difﬁcult [20, 14, 48].

Most computational analyses speciﬁc to viral mutation require rich metadata beyond raw sequence or make virus-speciﬁc assumptions [8, 54] (for example, vaccine-related temporal patterns in inﬂuenza, which are absent for HIV). Most similar to our approach, models exist for learning viral ﬁtness from a large sequence corpus [27, 26]. These approaches, however, requires time-consuming and error-prone multiple sequence alignment (MSA) preprocessing [29] and only consider pairwise information couplings among residues, which, as demonstrated below, limit performance when predicting escape. To our knowledge, our work is the ﬁrst to effectively model viral escape that generalizes to any relevant genomic sequence from diverse viruses, without the need for sequence alignment, complex metadata, or special assumptions on mutational processes.

To demonstrate how CSCS can alter semantics while preserving grammaticality, we gain intuition by ﬁrst applying CSCS in a natural language setting before demonstrating broader impact by applying CSCS to biological sequence mutation in viruses. We ﬁnd that CSCS-mutated headlines are semantically altered (quantiﬁed via changes in part-of-speech (POS) structure and distance in Word Net hierarchy) while remaining grammatical. Using a language model trained on a large corpus of inﬂuenza sequences, we ﬁnd that CSCS-mutated viral sequences are predictive of escape mutations (i.e., grammatical mutations that preserve biological viability and infectivity but that also alter the protein s semantics thereby enabling escape from vaccines or treatments) that were identiﬁed by independent biological experiments. To assess generality, we perform this zero-shot escape prediction in two different inﬂuenza subtypes and in HIV.

4.1 News Headlines

Setup and Training Data. We sought to conﬁrm our intuitions of semantic change and grammaticality by applying CSCS to single-word changes in news headlines. Our training corpus consisted of 1,186,018 headlines from the Australian Broadcasting Corporation from 2003 through 2019 (Appendix 6.1.1) [34].

Language Model Selection. We selected our model architecture by holding out a test set of headlines from 2016 onward (179,887 headlines, about 15%) and evaluating cross entropy loss for the language modeling task. We used a cross-validation strategy within the training set to grid search hyperparameters (Appendix 6.3.1). Our Bi LSTM model with access to the full context (described above) obtained a training and test loss of 2.2 and 6.0, respectively. Performance decreased when replacing the LSTM hidden layers with densely-connected layers (train loss = 2.3, test loss = 7.2) or when removing access to the right context, i.e., a language model task p(xi|x[i 1]) (train loss = 4.2, test loss = 6.5).

Table 1: Headline Semantic Change Results.

Median % POS Change Median Word Net Similarity

Setting NLTK FLAIR Pathwise Wu-Palmer

Semantically closest (smallest z[ xi]) 0.00% 0.00% 0.143 0.546

CSCS-proposed (highest a ( xi; x)) 16.7% 14.3% 0.0833 0.235

two-sided t-test P <10-308 <10-308 <10-308 <10-308

Table 2: Grammatical Acceptability Results

Number Acceptable (Out of 300)

Setting Human 1 Human 2 Human Consensus CSCS/Original Binomial P

CSCS-proposed (β = 0.25) 130 158 104 9.1 10-8

CSCS-proposed (β = 1) 200 192 174 0.25

Original headline 223 233 197 N/A

Signiﬁcant Semantic Change. For each headline, we considered all possible single-word mutations and picked the top according to the CSCS objective. Proposed mutations resulted in sentences that are qualitatively and quantitatively different than the original (Figure 2). CSCS often proposed word mutations that substantially change the part-of-speech (POS) structure. We quantiﬁed this observation by looking at the percentage of words in the mutated headline that had a different POS from the original headline. Using the NLTK POS tagger [10], the CSCS-proposed headline changed the POS of 16.7% of the words; using the FLAIR POS tagger [3], the median change was 14.3% of the words in the headline (Table 1). In contrast, the median POS change for the semantically-closest mutated headline (i.e., closest z[ xi]) was 0% for both POS taggers (Table 1). Even when POS was not changed, CSCS proposed strikingly different word mutations, which we quantiﬁed using semantic similarity scores based on distance in the Word Net hierarchy [39, 28]. Speciﬁcally, for noun-to-noun and verb-to-verb changes, we selected the ﬁrst Word Net synset corresponding to the depluralized or deconjugated version of the word. Across all these changes, the semantically-closest mutation had a median pathwise similarity of 0.14 and a median Wu-Palmer similarity [53] of 0.55 (both measures are between 0 and 1, inclusive, where 1 indicates high similarity, i.e., the same synset). In contrast, the CSCS-proposed mutation had a median pathwise similarity of 0.08 and median Wu-Palmer similarity of 0.24 (Table 1). Mean and standard deviation statistics, with similar trends, are also provided in Table S1. For both POS change and Word Net similarity, the difference between the CSCS-proposed and the semantically closest mutation are highly signiﬁcant (two-sided independent t-test P < 10-308). These results, supported by a qualitative examination of the changes (e.g., Figure 2), show that CSCS-mutated headlines are quite semantically different.

Grammaticality Preservation. We quantiﬁed grammaticality by asking human volunteers (12 in total) to provide grammatical acceptability labels. All humans were native English speakers with college degrees. Two humans were assigned to the same 150-headline text, blinded to the mutational status, and were asked to only evaluate grammaticality and not the content of the phrase, giving a binary yes or no label. Out of 300 original headlines, two humans provided a consensus yes grammatical label for 197 headlines (Table 2). The 300 corresponding CSCS-mutated headlines had 174 headlines with a consensus yes grammaticality; though lower, the number is within statistical error (two-sided binomial P-value of 0.25 compared with original). When we lowered β from 1 to 0.25, thereby reducing the inﬂuence of ˆp(xi|ˆzi), consensus grammaticality of the 300 CSCS-mutated headlines dropped signiﬁcantly to 104 (binomial P = 9.1 10-8; Table 2). These results suggest that by considering ˆp(xi|ˆzi), CSCS can preserve grammaticality. In general, CSCS of natural language produces intuitively satisfactory results and may be relevant to work in computational humor [52].

Figure 3: Semantic embedding space of inﬂuenza HA visualized in two-dimensions via UMAP [37] and colored by host species, subtype, or cluster labels from Louvain clustering [11].

4.2 Inﬂuenza

4.2.1 Language Model Training

Training Data and Model Selection. Our training data consists of 44,999 unique inﬂuenza A hemagglutinin (HA) amino acid sequences (around 550 residues in length) observed in animal hosts from 1908 through 2019. HA is a highly variable protein on the surface of inﬂuenza responsible for binding to host cells [24]. Since immunity to inﬂuenza is acquired by developing antibodies that bind and thereby neutralize HA, mutations to HA can lead to loss of immunity by reducing antibody binding afﬁnity (i.e., immunological escape ) [30, 33]. Data was obtained from the NIAID Inﬂuenza Research Database (IRD) [55] through the web site at http://www.fludb.org (Appendix 6.1.2). These sequences were all obtained from animal hosts and thus, at least implicitly, encode viral viability and infectivity. We evaluated language model performance with a test set of held-out HA sequences where the ﬁrst recorded date was before 1990 or after 2017, yielding a test set of 7,497 out of 44,999 sequences (about 17%). We again observed that a model with both an LSTM architecture and access to the full sequence context had the best train and test loss (Section 2.2.2).

Semantically Meaningful Embedding Structure. To improve our conﬁdence that the embeddings are functionally meaningful, we leverage tools for unsupervised exploration of high-dimensional data. We trained our language model on the full IRD HA corpus, averaged ˆzi across all residues in each sequence (to enable comparison across variable length sequences), and visualized the resulting embedding in two dimensions with Uniform Manifold Approximation and Projection (UMAP) [37, 17, 7].This results in clear structure corresponding to inﬂuenza subtype and host species (Figure 3), which we quantify via unsupervised Louvain clustering [11]. Within each cluser, on average, 99.8% of sequences come from a single inﬂuenza subtype and 96.2% come from a single host species, indicating high correspondence between semantic structure and biologically important metadata.

4.2.2 Zero-Shot Escape Prediction with CSCS

H3N2 Causal Escape Dataset. We validate the ability for CSCS to prioritize escape mutations using an interventional dataset by Lee et al., who made all possible single-residue mutations to HA from the A/Perth/16/2009 (H3N2) strain and assessed which mutants preserve viral infectivity and induce escape [36]. To quantify escape, Lee et al. measured the overrepresentation of infectious viral sequences after immune selection by neutralizing human antibodies. These mutants therefore preserve infectivity and causally induce escape from neutralizing antibodies.

CSCS Enrichment of Acquired Escapes. Based on the language model trained over the full IRD HA corpus (Section 4.2.1), we computed a ( xi; x) for all possible single-residue mutations to the A/Perth/16/2009 HA sequence. We emphasize that none of these mutants were present in the training corpus. The mutants identiﬁed by CSCS are substantially enriched for experimentallyveriﬁed escapes from Lee et al. [36], e.g., 4 out of the top 5 hits were conﬁrmed escapes (Figure 4). We quantiﬁed enrichment by computing the area under the curve (AUC) obtained by plotting acquired escape mutations versus total acquired mutations based on a ( xi; x), normalized by the maximum area to produce a score between 0 and 1, inclusive, where 0.5 indicates the expected value of random guessing. The AUC obtained by the full CSCS objective is 0.771, compared to 0.709 when acquiring

Figure 4: Left: Escape mutants (red Xs) to A/Perth/16/2009 from Lee et al. [36] have high semantic change and grammaticality. Right: Escape mutants are substantially enriched in top CSCS-acquired mutants; see Table 3.

Table 3: Escape Prediction Results

Normalized AUC

Model Inﬂuenza H1 Inﬂuenza H3 HIV Env

MAFFT MSA 0.697 0.598 0.523 EVcouplings (independent) 0.706 0.691 0.536 EVcouplings (epistatic) 0.726 0.687 0.552 z[ xi] alone 0.664 0.709 0.622 p( xi|x) alone 0.820 0.684 0.667 CSCS ( z[ xi] and p( xi|x)) 0.834 0.771 0.692

solely based on semantic change and 0.684 when acquiring solely based on grammaticality (Figure 4; Table 3), indicating that both are informative for determining escape. We obtained these results without direct supervision or explicit escape training data.

Benchmark of Existing Approaches. Though to our knowledge no previous method has been explicitly designed for escape prediction, we compare with standard viral ﬁtness model strategies that are the closest to our unsupervised problem setting. The ﬁrst strategy performs MSA of the viral sequence corpus and acquired escapes are simply those with the highest observed mutational frequency [12, 31, 4, 21]; our two benchmark methods that leverage this strategy are MAFFT MSA [29] and EVcouplings independent [26] (see Appendix 6.2.2 for more information). The second strategy also requires MSA followed by parameter estimation in a Potts model [27, 26], which incorporates pairwise residue information; we use the EVcouplings epistatic model that implements this approach, which is described in greater detail in Appendix 6.2.2. For inﬂuenza, we observed consistently higher AUCs obtained by CSCS over all benchmark methods (Table 3), noting that these methods were not speciﬁcally designed for viral escape prediction. We also tested pretrained protein sequence embedding models [9, 45, 5], not trained on viral corpuses, to see if their representations automatically transferred to viral escape prediction (Appendix 6.2.3), but this was not the case (Table S2), indicating that speciﬁc viral training data greatly improves escape prediction.

CSCS of H1N1 Viral Mutations. We evaluated CSCS on HA from another ﬂu strain, A/WSN/1933, from a different HA subtype (H1 instead of H3) for which causal escape mutations were also determined by the same experimental procedure above, albeit with a more limited set of neutralizing antibodies [19]. Using the same language model trained on the IRD corpus, we ranked all possible single-residue mutations of A/WSN/1933 HA, xi, based on a ( xi; x). We again found substantial enrichment of escapes (observed in [19]) in the top mutations; the normalized AUC of acquired escape mutations versus total acquired mutations was 0.834 (Table 3). We note that none of these mutated sequences were present in the training data. In contrast, other approaches had lower enrichment of acquire escape mutants (normalized AUC 0.726; Tables 3 and S2). Though similar causal escape

data is not available for other inﬂuenza strains, this additional validation increases our conﬁdence that escape prediction with CSCS generalizes across strains.

Setup and Training Data. To assess generality to other viral proteins, we analyzed the HIV-1 Envelope (Env) protein, which, like inﬂuenza HA, is responsible for binding and entering host cells and is also targeted by antibodies [6]. Env is larger than inﬂuenza HA (about 850 residues compared to around 550) and more readily escapes immune selection due to viral mutation, even within the same host [47]. We train our language model on 60,857 unique Env sequences from the Los Alamos National Laboratory (LANL) HIV database (Appendix 6.1.3) [23]. We used the same language model architecture as in the inﬂuenza HA experiments. We again observed functionally-meaningful patterns when visualizing the semantic embeddings of Env sequences (Figure S1).

Zero-Shot Escape Prediction with CSCS. We applied CSCS to a dataset quantifying the infectivity and escape potential of all single-residue mutations to Env from the BG505.T332N strain of HIV, using a similar experimental procedure as that for HA from the two inﬂuenza strains described above [18]. We ranked all single-residue mutations xi of BG505.T332N by the CSCS objective a ( xi; x). We again observed enrichment of escape mutations when acquiring based on both semantic change and grammaticality, though with a weaker enrichment than observed for inﬂuenza HA (normalized AUC = 0.692; Table 3), suggesting that the semantic complexity of HIV Env might be more difﬁcult to model with existing training data. However, CSCS escape prediction performance still exceeds that of other models (normalized AUC 0.574; Tables 3 and S2).

5 Discussion

Here we show that a learning-based, distributional approach to modeling viral sequence achieves unprecedented insight into evolution and escape, suggesting a timely and important direction for the machine learning community. Excitingly, we demonstrate that the distributional hypothesis is a productive assumption for analysis of viral variation. This is not obvious, since it may be possible for non-causal mutations to widely co-occur with causal escape mutations [32], but our results suggest that many of the mutations that alter distributional structure are also causal escape mutations (perhaps due to pressure on viral sequences to maintain both diversity and economy, thereby diminishing the importance of non-causal mutants).

The CSCS problem in general is useful for any domain in which substantial functional change is desirable but the feature changes are limited or constrained. For example, in exploring differences in human-versus-machine perception, it may be desirable to generate entities that are perceived as similar by humans but as vastly different by algorithms, or vice versa. Though we focus on zero-shot, unsupervised escape prediction, some supervision could be useful in improving performance (e.g., learning β from a handful of examples).

A broader problem is in modeling other changes aside from mutations, like insertions and deletions, or more complex sequence changes. CSCS that accommodates insertions and deletions (about four times rarer than mutations in viruses [48]) could likewise model semantic change as a shift in the embedding space and grammaticality as some function of an emitted language model probability. While single-token changes allow for interpretability and efﬁciency, CSCS could be extended to multi-token changes (e.g., by combining the individual mutational probabilities to approximate the joint probability), though the search problem then becomes combinatorial. It may also be possible to evolve a sequence over multiple timesteps, each with a new single-token change, to produce complex sequence designs.

Broader Impact

We hope that this work leads to broad positive impact by (1) encouraging those in the machine learning community to contribute to understanding and combatting viruses (and infectious disease more broadly) and by (2) providing state-of-the-art prediction of how viruses can mutate around neutralization, which could be useful as part of rational design of vaccines or therapies. In silico models of how mutation leads to pathogenesis might help reduce both the resources and risks

associated with experimentally characterizing viral mutants. A primary goal of infectious disease research in general is to mitigate and prevent pandemic disease events among the global human population, which lead to widespread mortality, suffering, and economic disruption.

In computationally predicting mutations that induce escape or improve viral ﬁtness, misuse could potentially take the form of using such methods to increase the pathogenicity of an existing viral strain. Experimental biologists, policy makers, and ethicists have already devoted and continue to devote a substantial amount of consideration to the ethics of such gain-of-function research (GOFR) [1, 49, 50]. As computational biologists become part of the GOFR conversation, attention to ethics is paramount and the scientiﬁc community should continue to preserve and strengthen the existing combination of experimental and policy safeguards.

Work in this area should continue to rely on direct experimental validation of computational prediction so that any system failures can be identiﬁed and corrected. Global viral surveillance already beneﬁts from international cooperation through entities like the World Health Organization and collaborations like the Global Virome Project [13], and both the IRD and LANL HIV databases already have substantial global coverage across six continents [55, 23]. Preventing datasets from bias toward certain geographies or human populations underscores the already high priority given to viral monitoring at a global scale.

Acknowledgments and Disclosure of Funding

We thank Alejandro Balazs, Owen Leddy, Adam Lerer, Allen Lin, Adam Nitido, Uma Roy, and Aaron Schmidt for helpful discussions. We thank Steven Chun, Benjamin De Meo, Ashwin Narayan, An Nguyen, Sarah Nyquist, and Alexander Wu for assistance with the manuscript. B.H. and E.Z. are partially funded by NIH grant R01 GM081871 (to B.A.B.). B.H. is partially funded by the Department of Defense (Do D) through the National Defense Science and Engineering Graduate Fellowship (NDSEG). E.Z. is partially funded by the National Science Foundation (NSF) Graduate Research Fellowship Program (GRFP). B.D.B. acknowledges funding from the Ragon Institute of MGH, MIT, and Harvard; MIT Biological Engineering; and NIH grant R01 A1022553.

[1] Doing Diligence to Assess the Risks and Beneﬁts of Life Sciences Gain-of-Function Research. The White House, President Barack Obama, 2014.

[2] Charu C Aggarwal, Alexander Hinneburg, and Daniel A Keim. On the Surprising Behavior of Distance Metrics in High Dimensional Space. In International conference on database theory, volume 1973, pages 420 434. 2001.

[3] Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif Rasul, Stefan Schweter, and Roland Vollgraf. FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 54 59, 2019.

[4] T. M. Allen, M. Altfeld, S. C. Geer, E. T. Kalife, C. Moore, K. M. O Sullivan, I. De Souza, M. E. Feeney, R. L. Eldridge, E. L. Maier, D. E. Kaufmann, M. P. Lahaie, L. Reyor, G. Tanzi, M. N. Johnston, C. Brander, R. Draenert, J. K. Rockstroh, H. Jessen, E. S. Rosenberg, S. A. Mallal, and B. D. Walker. Selective Escape from CD8+ T-Cell Responses Represents a Major Driving Force of Human Immunodeﬁciency Virus Type 1 (HIV-1) Sequence Diversity and Reveals Constraints on HIV-1 Evolution. Journal of Virology, 79(21):13239 13249, 2005.

[5] Ethan C. Alley, Grigory Khimulya, Surojit Biswas, Mohammed Al Quraishi, and George M. Church. Uniﬁed rational protein engineering with sequence-based deep representation learning. Nature Methods, 16(12):1315 1322, 2019.

[6] Kathryn Twigg Arrildt, Sarah Beth Joseph, and Ronald Swanstrom. The HIV-1 Env protein: A coat of many colors. Current HIV/AIDS Reports, 9(1):52 63, 2012.

[7] Etienne Becht, Leland Mc Innes, John Healy, Charles Antoine Dutertre, Immanuel W.H. Kwok, Lai Guan Ng, Florent Ginhoux, and Evan W. Newell. Dimensionality reduction for visualizing single-cell data using UMAP. Nature Biotechnology, 37(1):38 44, 2019.

[8] Trevor Bedford, Steven Riley, Ian G. Barr, Shobha Broor, Mandeep Chadha, Nancy J. Cox, Rodney S. Daniels, C. Palani Gunasekaran, Aeron C. Hurt, Anne Kelso, Alexander Klimov, Nicola S. Lewis, Xiyan Li, John W. Mc Cauley, Takato Odagiri, Varsha Potdar, Andrew Rambaut, Yuelong Shu, Eugene Skepner, Derek J. Smith, Marc A. Suchard, Masato Tashiro, Dayan Wang, Xiyan Xu, Philippe Lemey, and Colin A. Russell. Global circulation patterns of seasonal inﬂuenza viruses vary with antigenic drift. Nature, 523(7559):217 220, 2015.

[9] Tristan Bepler and Bonnie Berger. Learning protein sequence embeddings using information from structure. 7th International Conference on Learning Representations, cs.LG:1902.08661, 2019.

[10] Steven Bird, Edward Loper, and Ewan Klein. Natural Language Processing with Python. O Reilly Media Inc., 2009.

[11] Vincent D. Blondel, Jean Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008, 2008.

[12] John A. Capra and Mona Singh. Predicting functionally important residues from sequence conservation. Bioinformatics, 23(15):1875 1882, 2007.

[13] Dennis Carroll, Peter Daszak, Nathan D. Wolfe, George F. Gao, Carlos M. Morel, Subhash Morzaria, Ariel Pablos-Méndez, Oyewale Tomori, and Jonna A.K. Mazet. The Global Virome Project. Science, 359(6378):872 874, 2018.

[14] José M. Cuevas, Pilar Domingo-Calap, Marianoel Pereira-Gómez, and Rafael Sanjuán. Experimental Evolution and Population Genetics of RNA Viruses. The Open Evolution Journal, 3(1):9 16, 2009.

[15] Andrew M. Dai and Quoc V. Le. Semi-supervised sequence learning. Advances in Neural Information Processing Systems, pages 3079 3087, 2015.

[16] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ar Xiv, cs.CL(1810.04805), 2018.

[17] Alex Diaz-Papkovich, Luke Anderson-Trocmé, Chief Ben-Eghan, and Simon Gravel. UMAP reveals cryptic population structure and phenotype heterogeneity in large genomic cohorts. PLo S Genetics, 15(11):e1008432, 2019.

[18] Adam S. Dingens, Dana Arenz, Haidyn Weight, Julie Overbaugh, and Jesse D. Bloom. An Antigenic Atlas of HIV-1 Escape from Broadly Neutralizing Antibodies Distinguishes Functional and Structural Epitopes. Immunity, 50(2):520 532.e3, 2019.

[19] Michael B. Doud, Juhye M. Lee, and Jesse D. Bloom. How single mutations affect viral escape from broad and narrow antibodies to H1 inﬂuenza hemagglutinin. Nature Communications, 9(1):1386, 2018.

[20] J. W. Drake. A constant rate of spontaneous mutation in DNA-based microbes. Proceedings of the National Academy of Sciences of the United States of America, 88(16):7160 7164, 1991.

[21] Guido Ferrari, Bette Korber, Nilu Goonetilleke, Michael K.P. Liu, Emma L. Turnbull, Jesus F. Salazar Gonzalez, Natalie Hawkins, Steve Self, Sydeaka Watson, Michael R. Betts, Cynthia Gay, Kara Mc Ghee, Pierre Pellegrino, Ian Williams, Georgia D. Tomaras, Barton F. Haynes, Clive M. Gray, Persephone Borrow, Mario Roederer, Andrew J. Mc Michael, and Kent J. Weinhold. Relationship between functional proﬁle of HIV-1 speciﬁc CD8 T cells and epitope variability with the selection of escape mutants in acute HIV-1 infection. PLo S Pathogens, 7(2):e1001273, 2011.

[22] John Rupert Firth. A Synopsis of Linguistic Theory, 1930-1955. 1957.

[23] Brian Foley, Cristian Apetrei, Ilene Mizrachi, Andrew Rambaut, Bette Korber, Thomas Leitner, Beatrice Hahn, James Mullins, and Steven Wolinsky. HIV Sequence Compendium 2018. HIV Sequence Compendium, LA-UR 18-2, 2018.

[24] Steven J. Gamblin and John J. Skehel. Inﬂuenza hemagglutinin and neuraminidase membrane glycoproteins. Journal of Biological Chemistry, 285(37):28403 28409, 2010.

[25] Zellig S. Harris. Distributional Structure. WORD, 10(2-3):146 162, 1954.

[26] Thomas A. Hopf, Anna G. Green, Benjamin Schubert, Sophia Mersmann, Charlotta P.I. Schärfe, John B. Ingraham, Agnes Toth-Petroczy, Kelly Brock, Adam J. Riesselman, Perry Palmedo, Chan Kang, Robert Sheridan, Eli J. Draizen, Christian Dallago, Chris Sander, and Debora S. Marks. The EVcouplings Python framework for coevolutionary sequence analysis. Bioinformatics, 35(9):1582 1584, 2019.

[27] Thomas A. Hopf, John B. Ingraham, Frank J. Poelwijk, Charlotta P.I. Schärfe, Michael Springer, Chris Sander, and Debora S. Marks. Mutation effects predicted from sequence co-variation. Nature Biotechnology, 35(2):128 135, 2017.

[28] Daniel Jurafsky and James Martin. Speech and Language Processing. Pearson Education, Inc., 2014.

[29] Kazutaka Katoh and Daron M. Standley. MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Molecular Biology and Evolution, 30(4):772 780, 2013.

[30] Hyunsuh Kim, Robert G. Webster, and Richard J. Webby. Inﬂuenza Virus: Dealing with a Drifting and Shifting Pathogen. Viral Immunology, 31(2):174 183, 2018.

[31] Björn F. Koel, David F. Burke, Theo M. Bestebroer, Stefan Van Der Vliet, Gerben C.M. Zondag, Gaby Vervaet, Eugene Skepner, Nicola S. Lewis, Monique I.J. Spronken, Colin A. Russell, Mikhail Y. Eropkin, Aeron C. Hurt, Ian G. Barr, Jan C. De Jong, Guus F. Rimmelzwaan, Albert D.M.E. Osterhaus, Ron A.M. Fouchier, and Derek J. Smith. Substitutions near the receptor binding site determine major antigenic change during inﬂuenza virus evolution. Science, 342(6161):976 979, 2013.

[32] Katia Koelle, Sarah Cobey, Bryan Grenfell, and Mercedes Pascual. Epochal evolution shapes the phylodynamics of interpandemic inﬂuenza a (H3N2) in humans. Science, 314(5807):1898 1903, 2006.

[33] Adam J. Kucharski, Justin Lessler, Jonathan M. Read, Huachen Zhu, Chao Qiang Jiang, Yi Guan, Derek A.T. Cummings, and Steven Riley. Estimating the Life Course of Inﬂuenza A(H3N2) Antibody Responses from Cross-Sectional Data. PLo S Biology, 13(3):e1002082, 2015.

[34] Rohit Kulkarni. A Million News Headlines. kaggle, therohk/mi, 2020.

[35] Thomas K. Landauer and Susan T. Dumais. A Solution to Plato s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. Psychological Review, 104(2):211, 1997.

[36] Juhye M. Lee, Rachel Eguia, Seth J. Zost, Saket Choudhary, Patrick C. Wilson, Trevor Bedford, Terry Stevens-Ayers, Michael Boeckh, Aeron C. Hurt, Seema S. Lakdawala, Scott E. Hensley, and Jesse D. Bloom. Mapping person-to-person variation in viral mutations that escape polyclonal serum targeting inﬂuenza hemagglutinin. e Life, 27(8):e49324, 2019.

[37] Leland Mc Innes and John Healy. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ar Xiv, stat.ML(1802.03426), 2018.

[38] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, pages 3111 3119, 2013.

[39] George A. Miller. Word Net: A Lexical Database for English. Communications of the ACM, 38(11):39 41, 1995.

[40] Jonas Mueller, David N. Reshef, George Du, and Tommi Jaakkola. Learning optimal interventions. Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, pages 1039 1047, 2017.

[41] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glo Ve: Global vectors for word representation. Conference on Empirical Methods in Natural Language Processing (EMNLP), Proceedings of the Conference, 2014.

[42] Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identiﬁcation and conﬁdence intervals. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 78(5):947 1012, 2016.

[43] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep Contextualized Word Representations. Proceedings of NAACL-HLT, pages 2227 2237, 2018.

[44] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Open AI Blog, 1(8):9, 2019.

[45] Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Peter Chen, John Canny, Pieter Abbeel, and Yun Song. Evaluating Protein Transfer Learning with TAPE. Advances in Neural Information Processing Systems, pages 9686 9698, 2019.

[46] Philipp Rentzsch, Daniela Witten, Gregory M. Cooper, Jay Shendure, and Martin Kircher. CADD: Predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Research, 47(D1):D886 D894, 2019.

[47] Douglas D. Richman, Terri Wrin, Susan J. Little, and Christos J. Petropoulos. Rapid evolution of the neutralizing antibody response to HIV type 1 infection. Proceedings of the National Academy of Sciences of the United States of America, 100(7):4144 4149, 2003.

[48] Rafael Sanjuán, Miguel R. Nebot, Nicola Chirico, Louis M. Mansky, and Robert Belshaw. Viral Mutation Rates. Journal of Virology, 84(19):9733 9748, 2010.

[49] S. Schultz-Cherry, R. J. Webby, R. G. Webster, A. Kelso, I. G. Barr, J. W. Mc Cauley, R. S. Daniels, D. Wang, Y. Shu, E. Nobusawa, S. Itamura, M. Tashiro, Y. Harada, S. Watanabe, T. Odagiri, Z. Ye, G. Grohmann, R. Harvey, O. Engelhardt, D. Smith, K. Hamilton, F. Claes, and G. Dauphin. Inﬂuenza gain-of-function experiments: Their role in vaccine virus recommendation and pandemic preparedness. m Bio, 5(6):e02430 14, 2014.

[50] Michael J. Selgelid. Gain-of-Function Research: Ethical Analysis. Science and Engineering Ethics, 22(4):923 964, 2016.

[51] Laksshman Sundaram, Hong Gao, Samskruthi Reddy Padigepati, Jeremy F. Mc Rae, Yanjun Li, Jack A. Kosmicki, Nondas Fritzilas, Jörg Hakenberg, Anindita Dutta, John Shon, Jinbo Xu, Seraﬁm Batzloglou, Xiaolin Li, and Kyle Kai How Farh. Predicting the clinical impact of human mutation with deep neural networks. Nature Genetics, 50(8):1161 1170, 2018.

[52] I. M. Suslov. How to realize "a sense of humour" in computers? ar Xiv, cs.CL(0711.3197), 2007.

[53] Zhibiao Wu and Martha Palmer. Verbs semantics and lexical selection. Proceedings of the 32nd annual meeting on Association for Computational Linguistics, pages 133 138, 1994.

[54] Rui Yin, Emil Luusua, Jan Dabrowski, Yu Zhang, and Chee Keong Kwoh. Tempel: time-series mutation prediction of inﬂuenza A viruses via attention-based recurrent neural networks. Bioinformatics, page btaa050, 2020.

[55] Yun Zhang, Brian D. Aevermann, Tavis K. Anderson, David F. Burke, Gwenaelle Dauphin, Zhiping Gu, Sherry He, Sanjeev Kumar, Christopher N. Larsen, Alexandra J. Lee, Xiaomei Li, Catherine Mac Ken, Colin Mahaffey, Brett E. Pickett, Brian Reardon, Thomas Smith, Lucy Stewart, Christian Suloway, Guangyu Sun, Lei Tong, Amy L. Vincent, Bryan Walters, Sam Zaremba, Hongtao Zhao, Liwei Zhou, Christian Zmasek, Edward B. Klem, and Richard H. Scheuermann. Inﬂuenza Research Database: An integrated bioinformatics resource for inﬂuenza virus research. Nucleic Acids Research, 45(D1):D466 D474, 2017.