# deep_extrapolation_for_attributeenhanced_generation__022afb17.pdf

Deep Extrapolation for Attribute-Enhanced Generation

Alvin Chan* Salesforce Research, NTU

Ali Madani* Salesforce Research

Ben Krause Salesforce Research

Nikhil Naik Salesforce Research

Attribute extrapolation in sample generation is challenging for deep neural networks operating beyond the training distribution. We formulate a new task for extrapolation in sequence generation, focusing on natural language and proteins, and propose GENhance, a generative framework that enhances attributes through a learned latent space. Trained on movie reviews and a computed protein stability dataset, GENhance can generate strongly-positive text reviews and highly stable protein sequences without being exposed to similar data during training. We release our benchmark tasks and models to contribute to the study of generative modeling extrapolation and data-driven design in biology and chemistry: https://github.com/salesforce/genhance.

1 Introduction

Deep generative neural networks can generate realistic data across data-types, from sequences to images to time-series data, with applications in domains such as natural language processing (NLP), computer vision, and speech. Beyond canonical domains, the scientiﬁc application of synthetic design of proteins, molecules, and materials can be cast as generative modeling of sequences, graphs, or images (Anand & Huang, 2018; De Cao & Kipf, 2018; Madani et al., 2020, 2021). Most often, the goal is to design or generate a sample that improves upon the attribute label of interest (Fig. 1-(left)), which we term attribute-enhanced generation. Examples include generating a protein sequence with higher binding afﬁnity or a nanomaterial structure with an energetically favorable state, as compared to all of the samples in the training distribution. In these scientiﬁc ﬁelds, traditional methods for synthetic object design with improved attributes are iterative and expensive, relying on laboror compute-intensive methods (Bepler & Berger, 2021; Wu et al., 2021; Hie & Yang, 2021). Hence, deep generative models that can design new proteins, molecules, and materials with improved attributes have the potential to dramatically accelerate design research. Beyond scientiﬁc applications, extrapolation in generation has potential applications in NLP, such as reducing toxicity or operating in low-resource settings.

It is, however, a well-known challenge for deep neural networks to generate samples beyond the training distribution (Arora et al., 2017; Radford et al., 2019; Xu et al., 2020). In this work, we develop a method for extrapolation, particularly for sequences. Our approach, called GENhance, is designed to generate an enhanced sequence using a learned latent space. GENhance consists of a generator (sampler) and a discriminator (ranker) that are jointly trained to minimize generation and discrimination losses, regularized by latent vector smoothing and a cycle-consistency loss.

We evaluate GENhance in two data domains. First, we use the Stanford Sentiment Treebank (SST), a natural language benchmark containing movie reviews with ﬁve discrete sentiment attributes (Socher et al., 2013), to show that GENhance generates strongly positive reviews, after training with no

* Equal Contribution Correspondence to amadani@salesforce.com

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

[ ÂÕ¼ ããØ èã ó ¼è œ ú

W Ã bØ Ã W Ã

CèÂ Ø Ê Ü ÂÕ¼ Ü

bØ Ã ÜãØ èã ÊÃ

% Ã Ø ã ÜãØ èã ÊÃ

3URWHLQ ǻǻ*

[b V%:b$< :$C* <ŘŘŘ

[b V :b$< :$C* <ŘŘŘ

[b V :b$< :$C* : <ŘŘ

[b V :b$< :$C <ŘŘŘ

[ - - :b$< :$C* B <ŘŘŘ

5HYLHZ 5DWLQJ

Üã Üã è¼œ ¼ Ø ¼ú Ê èÜ ÜãÊØúŘ 6WURQJO\ 1HJDWLYH

¼ Ø ¼ú Ê èÜ ÜãÊØúŘ :HDNO\ 1HJDWLYH

Ã èÕ¼ ã Ã œ ú ã ¼ Ø ¼ú Ê èÜ ÜãÊØúŘ 1HXWUDO

Ø ¼ ã ó ¼ú Ã Êú ¼ ÜãÊØúŘ :HDNO\ 3RVLWLYH

Ø ãœ èÕ¼ ã Ã ÜãÊØú Ã ØãŬô ØÂ Ã ÜãÊØúŘ 6WURQJO\ 3RVLWLYH

ùãØ ÕÊ¼ ã W Ã

BÊØ TÊÜ ã ó BÊØ [ã ¼

Figure 1: Attribute-enhanced generation. The goal of extrapolation (left) is to generate samples whose attribute values exceed that of all training samples, y . We explore target attribute extrapolation for protein sequences (center) and movie reviews (right), where more stable protein sequences and more positives text reviews are generated.

positive examples. Second, we develop a protein stability dataset for the ACE2 protein (Chan et al., 2020) with a change in free energy (dd G) continuous attribute, and show that GENhance can generate protein sequences with higher stability than the training set (Fig. 1-(right)). GENhance signiﬁcantly outperforms baseline methods based on (i) a generator-discriminator model with rejection sampling and (ii) an algorithm using Metropolis-Hastings Markov chain Monte Carlo sampling with a trained discriminator. GENhance s performance is further improved when provided access to a few examples with attribute scores beyond the training distribution. Our contributions are summarized below:

We formalize the task of extrapolation for deep generative models, focused on enhancing attributes

in sequence generation, with important scientiﬁc applications in synthetic object design.

We introduce GENhance, a regularized encoder-decoder framework with a learned latent space,

and demonstrate its superior performance with respect to rigorous baseline techniques.

We curate extrapolation benchmarks in NLP and proteins. We release the data, evaluation metrics,

along with oracle models and scripts for automatic evaluation of generation quality.

2 Related Work

Generalization to Low Data Regimes: Previous approaches aim to generalize classiﬁcation and regression to low data settings. Imbalanced classiﬁcation methods upsample or downsample classes(Chawla et al., 2002; García & Herrera, 2009) or reweight the training cost function (Huang et al., 2016; Cao et al., 2019; Cui et al., 2019). Yang et al. (2021) improve the generalization of regression models in extrapolation and interpolation of the continuous data domain by smoothing both the label and features of the training data. Unlike prior work in this area, GENhance aims to generate samples in low/no data settings. Methods that can better generalize discriminators to these regions are complimentary and orthogonal to our work.

Data-Driven Design: Data-driven design aims to learn a distribution over a high-dimensional input space that is optimized for a ﬁtness function corresponding to a desirable property. Design methods often iterate sampling from a generator, and then updating the generator to assign a higher probability to inputs that a discriminator predicts to have higher ﬁtness (Bedbrook et al., 2019; Biswas et al., 2021; Mansouri Tehrani et al., 2018). Auto-focused oracles (Fannjiang & Listgarten, 2020) also adapt discriminators throughout this optimization process to using re-weighting of the training examples in the cost function to make them more reliable in the regions where the generator is more likely to generate. Cb AS (Brookes et al., 2019) and Db AS (Brookes & Listgarten, 2018) use a ﬁxed discriminator/oracle model and iteratively learns the distribution of inputs conditioned on a desirable property using importance sampling. Cb AS is an improved version of Db AS which also re-weights samples based on how close they are to the original training data. We view these techniques as complementary as GENhance proposes a model-speciﬁc architecture for optimizing attributes.

Das et al. (2021) train VAE to learn latent space and use latent space classiﬁers to sample latent vectors through rejection sampling and decode them into sequences that would have the target attribute/label. Hawkins-Hooker et al. (2021) also decode generations of a VAE by conditioning on latent vectors

that correspond to the target attribute/label. Hoffman et al. (2020) seek to optimize molecular designs by using zeroth-order optimization on query-based prediction of candidate molecules properties. Gómez-Bombarelli et al. (2018) build a Gaussian Process (GP) regression model trained with latent vectors to predict their inputs labels and use gradient-based optimization on the GP to ﬁnd sequences with target attributes. Compared with these previous works, the core difference in our approach is the combination of cycle-consistency and contrastive discriminatory objective to train the generator and discriminator as one model.

Controllable Text Generation: Our work is also related to controllable text generation, which aims to generate text that corresponds to a user-speciﬁed attribute (Kikuchi et al., 2016; Ficler & Goldberg, 2017). CTRL (Keskar et al., 2019) generates controlled ﬂuent texts through the use of control codes which are meta-data prepended to the text during generation. Krause et al. (2020) use a generative discriminator resulting from contrasting predictions from opposing control codes to guide generation. Co Con (Chan et al., 2021) performs zero-shot controllable text generation without attribute labels. (Ziegler et al., 2019) optimizes language generation for desirable attributes via human in-the-loop reinforcement learning. Similarly to GENhance, PPLM (Dathathri et al., 2020) applies a discriminator on top of the latent space of a generative model to guide generation, however, GENhance uses an autoencoder rather than a language model. Lastly, text style transfer methods have used autoencoders with disentangled style latent representations (Shen et al., 2017; Hu et al., 2017; Yang et al., 2018). Unlike text style transfer and previous approaches toward controllable text generation, GENhance differs, aside from its model formulation, in that the goal is to optimize and extrapolate a particular attribute beyond the training distribution.

Our goal is to generate sequences with target attribute values that are better than the the training data. Formally, assume that there is a ground-truth oracle (O) that maps each sample (x 2 Rd) to the target attribute value (y 2 R) , i.e., y = O(x). Given a dataset of oracle labeled samples (D), we aim to generate new sequences where its ground-truth attribute value is better than that of this dataset:

ynew > y , 8(x, y) 2 D : y < y (1)

To generate samples that satisfy this criterion with high probability, we develop a sampling-ranking framework that consists of a sampler S that proposes a pool of candidate sequences and a ranker R model to infer the relative scores of these candidates. First, we describe two baseline generation techniques that are natural choices for this task and build on them to develop our GENhance model.

3.1 Generator-Discriminator Rejection Sampling

The ﬁrst baseline, Gen-Disc, is a rejection sampling approach that uses a generator model as the sampler S and a separate discriminator model as the ranker R . The generator is trained to model the training distribution p(x) through a language modeling objective where it learns to auto-regressively (Manning et al., 1999; Bengio et al., 2003) construct the training samples,

p(xt, . . . , xl|x1, . . . , xt 1) =

p(xi|x1, . . . , xi 1), x = xi, . . . , xl (2)

where l is the length of the training sequence.

The discriminator model is trained with an objective to predict the relative ranking of sequences from a pair of samples, based on their attribute. Given two training samples (xa, ya) and (xb, yb), the pairwise contrastive loss is:

Lcontrast = log

1 1 + exp( ya yb)

, ya = fdisc(xa), ya > yb (3)

where fdisc denotes the discriminator which outputs a scalar score value for each input sequence. We employ the contrastive loss term for this objective since it can be applied to both continuousand discrete-labeled samples. After training, we sample candidate sequences from the generator model in an auto-regressive fashion and use the discriminator to rank the sequences according to the discriminator s output score values.

Figure 2: GENhance is an encoder-decoder framework with a latent space between the two. GENhance is trained to extrapolate beyond the training distribution of attributes, by learning the latent space using a combination of contrastive, smoothing and cycle consistency losses, in addition to the reconstruction loss for autoregressive generation.

3.2 Metropolis-Hastings Markov Chain Monte Carlo

Traditional methods for data-driven design rely on iterative optimization of candidates with better attributes. To mimic this process, we design a method that generates candidates using Metropolis Hastings MCMC sampling from a population of better candidates. We start with an initial population of sequences, sampled from the training set. In the sampling step, new candidates are proposed by making edits to samples from this initial population, scored with the ranker R, and compared with the previous population of candidates. The probability that new generations are kept in the population depends on the score predicted by R. The cycle of sampling and ranking repeats until terminated. The ranker R takes the form of a neural network, identical to the discriminator model in the Gen-Disc setup.

3.3 GENhance

In the Gen-Disc framework, since the generator is trained only to model the training distribution, there is no direct way to steer the generation distribution towards a certain direction beyond the training data. In the MCMC framework, it might be challenging to ﬁnd desirable candidates with stochastic mutation operations, since the search space can be large and high-dimensional for many design problems (Kumar & Levine, 2019). To overcome these limitations, we propose using a learned latent space (Kingma & Welling, 2013) to control the attributes of generated sequences.

Architecture: GENhance is an encoder-decoder framework with a latent space between its encoder (ENC) and decoder (DEC) modules (Figure 2). The latent vector (z 2 Rdz) of a particular input sequence (x 2 Rdx) is the output from the encoder module, i.e., z = ENC(x). In our experiments, z is the hidden state at the location of a < cls > token (Devlin et al., 2018) that is prepended to the input sequence. Within the latent vector z, representation relevant and irrelevant to the attribute of interest is stored in z|| and z? respectively, i.e., z = [z||; z?]. To train the encoder to store information about the target attribute in z||, we train it with a contrastive objective that aims to learn which of the two samples has the better value of the attribute:

Lcontrast = log

1 1 + exp( ya yb)

, ya = f||(za||), [za||; za?] = ENC(xa), ya > yb

(4) where (xa, ya) and (xb, yb) are a pair of training samples, each containing an input sequence x and its label y. f|| is an operation that maps z|| to a scalar value. Here we use z|| of dimension 1 and f|| is simply an identity operation.

We train GENhance to generate sequences using an objective where the decoder will autoregressively reconstruct a sequence while conditioned on the latent vector z. For an input sequence x of length l, parameterizing the ENC with and DEC with , we get the reconstruction loss:

logp (xi|z, {x1, . . . , xi 1}) =

log p , (xi|x) (5)

To ensure that the perturbed latent vector z would result in plausible generations, we include a smoothing objective, the deterministic Wasserstein autoencoder-maximum mean discrepancy (WAEMMD) objective (Tolstikhin et al., 2017), to train the latent space as it has shown to be effective for discrete sequences. The WAE-MMD term (deﬁned in the Supplement A.1) penalizes divergence of the latent vectors z from a target prior distribution Pz, which is a unit Gaussian in our case.

Lsmooth = MMD(Pz, z) (6)

To help learn a better latent space and stronger discriminator within GENhance, we propose a cycleconsistency learning objective (Lcyc-con) to train ENC to correctly predict the relative rank between two reconstructed inputs:

Lcyc-con = log

1 1 + exp(ˆya ˆyb)

, ˆya = f||(ˆza||),

[ˆza||;ˆza?] = ENC( xa), xa = DEC(ENC(xa)), ya > yb

The intuition behind this objective is two-fold. First, since the discriminator (ENC) is used to rank generated sequences during inference, we can improve its performance on these synthetic sequences by also training the discriminator (ENC) on generated sequences ( x) during the training phase. Secondly, by backpropagating the Lcyc-con term through GENhance, it could learn a latent space which generates sequences that are easy for the discriminator to rank accurately. Combining all the training objectives, we can optimize using stochastic gradient descent to approximate the optimal parameters for GENhance:

, = arg min

(λcontrast Lcontrast + λrecon Lrecon + λsmooth Lsmooth + λcyc-con Lcyc-con) (8)

To the best of our knowledge, this is the ﬁrst instance of using cycle-consistency with contrastive loss to train a generative model.

Sampling & Ranking: After training, we can sample candidates from GENhance s latent space and rank the generated samples with the scalar scores output by the GENhance s ENC. First, we encode a training sample with ENC to sample a latent vector z. To obtain the latent vector encoding for a new candidate sequence with an improved attribute, we can make a perturbation ( z||) to the target attribute-aligned latent component z||. At the ﬁnal step, GENhance s DEC conditions on this perturbed latent vector (z0) to generate the improved candidate x0:

x0 = DEC(z0), z0 =

(z|| + z||) ; z?

= ENC(x) (9)

The perturbation z|| is determined as the direction that increases the f|| s score output, i.e.,

@z|| . For a linear layer f||, this term is the weight of the layer while for our case where f|| is an identity operator, z|| is simply a scalar.

After generating a pool of candidates with GENhance, we can rank and ﬁlter out top-scoring candidates with the GENhance ENC s predicted score:

ˆy = f||(ENC( x0)) (10)

4 Experiments and Results

4.1 Experiments in Natural Language with SST-5

Dataset The Stanford Sentiment Treebank-5 (SST-5) (Socher et al., 2013) contains movie reviews from Rotten Tomatoes, which are labeled with one of the ﬁve ordinally increasing sentiment labels:

Table 1: GENhance generates a large fraction of attribute-enhanced sequences for SST-5, when 200 Weak-Positive samples are present in the training set (< 4% of the training set). Metrics are computed for top-1000 ranked sequences. SST-5 test samples have a mean perplexity value of 101.3. Smoothing = Latent Smoothing, CC = Cycle-consistency.

Model % Positive % Strong-Positive Perplexity E[%SP] (" better) (" better) (# better) (" better)

Baseline Gen-Disc 90.6 26.7 63.9 14.2 MCMC-Random 17.6 0.5 49696 0.17 MCMC-T5 54.8 10.8 224 4.58 GENhance w/o Smoothing & CC 88.2 21.5 125 15.44 GENhance w/o CC 91.3 23.6 101 16.62 GENhance 98.7 49.7 90.5 44.33

Strong-Negative , Negative , Neutral , Positive , and Strong-Positive . This allows us to study

enhancement approaches for discrete ground-truth labels. We curate two data splits. The ﬁrst SST-5 200-Pos setup removes all Strong-Positive examples from the training set while keeping 200 randomly sampled Weak-Positive examples, to simulate the presence of a small amount of higher attribute samples. For the more challenging SST-5 No-Pos setup, both Weak-Positive and Strong-Positive samples are removed. The 200-Pos and No-Pos training set have 5134 and 4934

samples respectively.

Training For both the Gen-Disc and MCMC models, we train the discriminator model by ﬁnetuning a publicly available (Wolf et al., 2019) pretrained T5-base encoder (Raffel et al., 2019). The generator module of both Gen-Disc and GENhance are trained by ﬁnetuning the whole pretrained T5-base encoder-decoder model. The Gen-Disc generator is trained with a language modeling objective by feeding in an empty string as the encoder s input and minimizing the cross-entropy loss between the decoder s output tokens and training sample s tokens through teacher forcing. For GENhance, the training samples are both fed in as the T5 encoder s input and used as the label for the decoder s output for the reconstruction objective. Further details on training settings on four NVIDIA A100 GPUs are found in the Supplement A.1 to A.3.

Evaluation: We generate 25,000 candidate sequences from each model and use their respective discriminator module to rank the sequences into pools of top-100, 1000 and 10000 sequences. The percentage of candidates containing target attributes ( Strong-Positive & Weak-Positive ) are computed by using a ground-truth oracle model. In our experiments, we use a pretrained BERTlarge (Devlin et al., 2018) model that is ﬁnetuned on the full training SST-5 training set (including Strong-Positive & Weak-Positive samples), with a classiﬁcation objective. This oracle model is

trained with a batch size of 32 for 30 epochs and achieves an accuracy of 92.5% for strong-positive vs neutral/negative classiﬁcation. Neutral -labeled SST-5 sequences are used as the initial sequences for the MCMC baselines and as the input sequence for the GENhance models. z|| perturbations of magnitude equal to 5% of the standard deviation of the training samples z|| are used for all GENhance generations.

We develop an additional performance metric, E[%SP], the expected percentage of Strong-Positive generated sequences. The metric was developed to (i) have a statistically-relevant measure with an expectation value and (ii) use the Strong-Positive labels alone to maximize Oracle label ﬁdelity, as Strong-Positive labels are almost perfectly distinguishable from the train labels of Neutral and

lower. It is computed with the following steps: a) Randomly sample 1000 of generations from the 25000 generations, b) ﬁlter out top-100 candidates based on discriminator s ranking, c) compute % Strong-Positive in top-100 with ground-truth oracle model and d) repeat step a) to c) for 100 rounds

and average % strong-positive.

As a proxy for text quality, we compute the perplexity value for each generation using a pretrained GPT-2 large model (Radford et al., 2019) and average their values across the top-K pools. To guide the MCMC substitution step (MCMC-T5), we use a pretrained T5-base model, since random token substitution would degrade ﬂuency. During mutation, a span of 1 or 2 tokens is masked and the masked sequence is fed into the T5 model to generate a replacement.

Finally, to further evaluate generated text aside from the oracle model s scores, we conducted a human evaluation study to examine the positiveness and ﬂuency of our text generations. The study

Table 2: GENhance generates a large fraction of attribute-enhanced sequences for SST-5, when no positive samples are present in the training set. Metrics are computed for top-1000 ranked sequences. SST-5 test samples have a mean perplexity value of 101.3. Smoothing = Latent Smoothing, CC = Cycle-consistency.

Model % Positive % Strong-Positive Perplexity E[%SP] (" better) (" better) (# better) (" better)

Baseline Gen-Disc 65.1 11.4 61.7 7.65 MCMC-Random 22.9 0.3 20924 0.28 MCMC-T5 46.4 6.2 125 5.81 GENhance w/o Smoothing & CC 42.3 5.6 596 5.46 GENhance w/o CC 69.5 9.3 126 7.8 GENhance 87.7 21.4 118 19.52

was formulated as an A/B test where three evaluators were asked to compare pairs of text in a blinded random manner to separately determine which text was more positive or more ﬂuent than the other. The comparisons were between text generated by GENhance vs 1) the Gen-Disc baseline, 2) MCMC baseline, 3) SST5 train data labeled as neutral, and 4) SST5 train data labeled as positive. The evaluators compared 100 samples for each of the four comparisons in the 200-Pos and No-Pos settings, totaling to 800 A/B test comparisons. For each comparison, the majority answer between the three evaluators was assigned as the ﬁnal score.

6WURQJ 3RVLWLYH 1HXWUDO :HDN 3RVLWLYH 6WURQJ 1HJDWLYH

:HDN 1HJDWLYH

6WURQJ 3RVLWLYH 1HXWUDO :HDN 3RVLWLYH 6WURQJ 1HJDWLYH

:HDN 1HJDWLYH

'HQVLW\ 'HQVLW\

7UDLQ GDWD *(1KDQFH

0&0& 7 %DVHOLQH *HQ 'LVF

6WURQJ 3RVLWLYH 1HXWUDO :HDN 3RVLWLYH 6WURQJ 1HJDWLYH

:HDN 1HJDWLYH

6WURQJ 3RVLWLYH 1HXWUDO :HDN 3RVLWLYH 6WURQJ 1HJDWLYH

:HDN 1HJDWLYH

Figure 3: GENhance generations have a higher proportion of Strong-Positive samples than the two baselines, and successfully extrapolate from the training data distribution. Shown here are top-1000 ranked generations for 200-Pos. Colored vertical lines show the generations mean value.

Results: GENhance outperforms all baselines and ablation variants for all % positive metrics (Table 1 and 2). 49.7% of the more challenging Strong-Positive sequences generated by GENhance are correct, which is almost twice the % of samples generated by Gen-Disc, the strongest baseline. All models see performance drops in the % positive metrics for the No-Pos training setup as compared to the 200-Pos setup, except for MCMC-Random, which is signiﬁcantly worse in both cases. This reﬂects the greater challenge in generating desirable candidates when there are no positive samples in the training data. GENhance also outperforms other baselines in the top-1000 and top-10000 pools of candidates (See Supplement A.5). Figure 3 shows that the baselines and GENhance can generate more positive sequences than the training, with GENhance showing the largest distribution shift towards candidates with enhanced positiveness.

GENhance generations also have lower perplexity values (i.e., better text quality) than the baseline and ablation methods, except for Gen-Disc, which explicitly models the training distribution (more in Supplement A.5). In fact, the average GENhance perplexity value (118) is close to the perplexity of SST-5 test samples (101.3).

According to the human evaluation (Table 4), GENhance succeeds in generating positive, ﬂuent text that outperforms baselines. In the setup where the models were exposed to 200 positive training samples (200-Pos), GENhance outperformed all baselines, including Neutral and Weak-Positive training samples, in the positiveness of text generations. For the setting where the models were not exposed to any positive samples (No-Pos), GENhance is comparable to the positive train samples in

Table 3: GENhance enhances neutral sequences from SST-5 to be strongly-positive , as seen in these generated samples. Positive words in blue and negative words in red.

Original Text (Attribute: Neutral ) Generated Text (Attribute: Strongly-positive )

A melancholy, emotional ﬁlm. A melodramatic ﬁlm, this is a powerful story.

An ambitious and moving but bleak ﬁlm. An ambitious and light-hearted ﬁlm, it is a strong and moving story.

Some stunning visuals and some staggeringly boring cinema.

An engaging collection of fantastic visuals and yet at the same time stunningly striking.

You ll laugh for not quite and hour and a half, but

come out feeling strangely unsatisﬁed.

You will laugh very well and laugh so much you will end up feeling great afterwards

A dark, dull thriller with a parting shot that misﬁres. A dark and compelling thriller that ends with a bitter, compelling punch.

positiveness and outperforms all the other baselines. Likewise, the ﬂuency of GENhance s generations either match or outperform that of the training samples and baselines.

Table 4: In a human evaluation experiment of text generations on positiveness and ﬂuency, GENhance outperforms baseline methods in both positiveness and ﬂuency in most cases. Values reported in % (" better for all metrics).

Model 200-Pos No-Pos Positiveness Fluency Positiveness Fluency

Train Neutral 7 30 18 39 GENhance 91 52 70 49 Train Weak-Positive 26 39 48 50 GENhance 63 39 45 34 Gen-Disc 29 42 35 44 GENhance 68 41 53 43 MCMC-T5 17 18 34 46 GENhance 76 61 56 36

Ablation Study: Both latent smoothing and cycle-consistency objectives contribute to generating sequences with improved attributes and text quality. Without the cycle-consistency objective, we observe a drop in performance across all metrics, indicating that the objective is vital in helping the latent space and encoder generalize to sequences outside the training distribution. When the latent smoothing objective is removed, especially for the more challenging No-Pos setup, the generation quality drops, as indicated by the large increase in perplexity. This indicates that the smoothing objective is important in learning a latent space that is amenable to perturbations that control its attributes while maintaining generation quality.

4.2 Experiments in Protein Design with ACE2 Proteins

Dataset: Designing a protein with an optimized property (e.g. stability) is of immense interest to synthetic biology and drug discovery. Here, we create a new synthetic dataset of stability for mutations of human angiotensin-converting enzyme 2 (ACE2) protein. Since the SARS-Co V-2 virus binds to ACE2 to gain entry into human organs, ACE2 has emerged as a promising target for COVID-19 therapeutic protein design (Chan et al., 2020). Our optimization problem is to generate an ACE2-like protein sequence that is more stable than samples in the training set. As a proxy for experimentally measured stability of a protein sequence, we use the free energy calculation via Fold X (Schymkowitz et al., 2005) which provides an automated, computational oracle for testing extrapolation methods in silico. In particular, we measure the change in free energy from wild-type, dd G or G, between the folded and unfolded state of a protein sequence with the known ACE2 structure. A lower dd G value indicates a sequence that is more stable.

We mutate the N-terminus subdomain of ACE2. The protein is represented as a sequence of 83 amino acids starting from the N-terminus side, with a vocabulary of 20 amino acids. We curate 250K ACE2 variants by mutating the wild-type (natural) ACE2 subdomain through substitutions and

Table 5: GENhance generates a large fraction of highly stable ACE2-like sequences, with better mean stability as compared to baselines. Metrics are computed for top-100 ranked sequences. Smoothing = Latent Smoothing, CC = Cycle-consistency.

Model dd G mean PCIy (%) E[min] (# better) (" better) (# better)

Baseline Gen-Disc -4.05 3 -6.18 MCMC -4.84 9 -6.89 Db AS -3.48 2 -4.67 Cb AS -4.31 5 -6.49 GENhance w/o Smoothing & CC -7.16 66 -8.31 GENhance w/o CC -5.59 30 -7.68 GENhance -7.34 77 -8.71

computing their dd G values. To keep a local landscape of deviation from wild-type, amino acids were substituted by another amino acid with a probability of P = 4/L where L is the length of the protein s mutable region. Mutants with more than eight mutations are discarded and a constant region (NTNITEEN) is maintained. The dd G values for each sequence are computed as the average over ﬁve Fold X simulations. More details are in Supplement A.4. We use this dataset to evaluate GENhance s ability to generate protein sequences with lower dd G values than those found in the training distribution. In contrast to SST-5, the ACE2 dd G values lie on a continuous scale, allowing us to validate GENhance in the continuous label setting.

Training: To initialize the weights of our models, we use a T5-base model that is pretrained on Uniref50 (Suzek et al., 2015) with a masked span objective of mean length 3. The discriminator model for both Gen-Disc and MCMC models are trained by ﬁnetuning the pretrained encoder on the full set of 250K sequences, with a random 10% used as the validation set while the generator modules of both Gen-Disc and GENhance are trained by ﬁnetuning the whole pretrained encoder-decoder model. Further details on training settings on four NVIDIA A100 GPUs are found in the Supplement.

0RUH 6WDEOH

/HVV 6WDEOH

Figure 4: GENhance-generated ACE2-like sequences show the largest shift in dd G distribution (i.e., stability improvement) from the training set. Top-100 ranked generated sequences are shown. dd G values are binned with an interval of 1. Triangles denote the distributions mean values.

Evaluation: For evaluation, we generate 250,000 sequences from each model while eliminating generations without the constant region (NTNITEEN) or with a different sequence length from the wild-type sequence. We then use the methods respective discriminator modules to rank the candidates into pools of top-10, 100 and 1000 sequences. The top-K sequences

dd G values are then computed with the Fold X software, taking the average over ﬁve simulation runs. The top-5% most stable sequences are used as the initial sequences for MCMC and as the input sequence for GENhance. z|| perturbations of magnitude equal to 25% of the std. dev. of the training samples z|| are used for all the GENhance models. Following Fannjiang & Listgarten (2020), we also measure percent chance of improvement (PCIy ) over the best label (most negative dd G) in training data. To have a statistically-relevant metric, we developed the expected minimum dd G value (E[min]) which is computed by the following steps: a) Randomly sample 10000 of generations from the 250,000 generations, b) ﬁlter out top-10 candidates based on discriminator s ranking, c) compute dd G top10 with Fold X oracle software to ﬁnd the minimum dd G value among these 10 candidates and d)

repeat step a) to c) for 100 rounds and average the minimum dd G values across these rounds.

In addition to Gen-Disc and MCMC, we include Cb AS (Brookes et al., 2019) and Db AS (Brookes & Listgarten, 2018) as baselines for comparison in this task. Both Cb AS and Db AS use the same model architecture as the baseline Gen-Disc. We ﬁrst sample generations from the baseline Gen-Disc s generator then retrain the generator on the generations re-weighted by the discriminator s scores. We used the initial hyperparameters from Cb AS and conducted a grid search on a) M, number of generation per iterations (50, 100, 200), b) Q, percentile threshold (75, 90), c) temperature of a sigmoid score-based weight computation (0.1, 1, 10) and report the best results for Cb AS. The Db AS hyperparameter values mirror those of Cb AS in our experiments.

Results: GENhance outperforms all baselines on all metrics (Table 5) in designing more stable sequences. GENhance sequences have the lowest mean dd G value, with a signiﬁcant fraction of generations more stable than the most stable sequence in the training set, as indicated by the higher PCIy values. GENhance also has the lowest E[min] value, indicating that it may be well-suited to ﬁnd stable protein candidates in laboratory experiments where only small numbers of candidates can be evaluated due to cost. Even though MCMC and Cb AS fare better than the simpler baseline Gen-Disc setup, we observe that GENhance outperforms both baselines on all three metrics measured. The distribution of generated samples by GENhance shows the largest shift towards more stable sequences as compared to the original training distribution (Figure 4).

Ablation Study: Similar to SST-5 experiments, GENhance outperforms its ablation variants on all metrics. Rather surprisingly, we observe a drop in performance when the latent smoothing objective is added, which we speculate is due to the tension between the GENhance s reconstruction and its encoder s contrastive training objectives. With the cycle-consistency objective, we see a boost to GENhance that outperforms the vanilla variant, indicating that this objective aids in stabilizing the convergence of these two training objectives. To further study their contribution to GENhance s performance, we use GENhance s encoder to rank sequences generated by the generator in the baseline Gen-Disc setup and observe a boost in performance (Supplement Table 12). This suggests that GENhance s superior performance is due to both more accurate ranking by its encoder and better generation by its decoder.

Discussion: There are two main features of GENhance that may contribute to its ability to generate attribute-enhanced sequences. First, compared to the baselines which mainly rely on the discriminator s prediction to ﬁlter out promising candidates, GENhance can additionally use its latent space to steer the general distribution of generated candidates towards a target region (e.g., more stable or positive sequences). Second, unlike the discriminators in the baselines which were trained only on training samples, the encoder used in GENhance was also trained on GENhance-generated sequences through the cycle-consistency loss. This may contribute to the better-ranking performance for GENhance, increasing the fraction of desirable candidates.

5 Conclusion

In conclusion, we formalize the task of attribute-enhanced generation that aims to create improved samples with target attributes beyond the training distribution. Scientiﬁc applications can include the design of proteins, materials, and molecules without expensive, iterative procedures for discovery. To achieve this, we proposed GENhance, a generative model with a trained latent space, that generates sequences that outperform both the training data and baseline methods in natural language and protein engineering tasks. In the future, we aim to expand GENhance to other data types beyond sequences and study generation in scenarios where new data samples could be actively acquired. We also open-source our curated benchmark datasets with computational oracles along with all models and evaluation metrics/scripts to enable further research in extrapolation: https://github.com/salesforce/genhance.

Broader Impact: Extrapolation involves designing samples, whether text, proteins, molecules, or materials, that have attributes that are unseen in training. If our technique or a future iteration thereof is adopted broadly, care should be taken in terms of the end use-cases of these designed/optimized samples and downstream effects to ensure safe, non-nefarious, and ethical applications. For projects in any domain, active oversight during project initiation, experimental optimization, and deployment phases should be put in place to ensure safe usage and limitation of unintended harmful effects.

Namrata Anand and Po-Ssu Huang. Generative modeling for protein structures. Proceedings of the

32nd International Conference on Neural Information Processing Systems, pp. 7505 7516, 2018.

Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium

in generative adversarial nets (gans). International Conference on Machine Learning, pp. 224 232, 2017.

Claire N Bedbrook, Kevin K Yang, J Elliott Robinson, Elisha D Mackey, Viviana Gradinaru, and

Frances H Arnold. Machine learning-guided channelrhodopsin engineering enables minimally invasive optogenetics. Nature methods, 16(11):1176 1184, 2019.

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic

language model. Journal of machine learning research, 3(Feb):1137 1155, 2003.

Tristan Bepler and Bonnie Berger. Learning the protein language: Evolution, structure, and function.

Cell Systems, 12(6):654 669, 2021.

Surojit Biswas, Grigory Khimulya, Ethan C Alley, Kevin M Esvelt, and George M Church. Low-n

protein engineering with data-efﬁcient deep learning. Nature Methods, 18(4):389 396, 2021.

David Brookes, Hahnbeom Park, and Jennifer Listgarten. Conditioning by adaptive sampling for

robust design. In International Conference on Machine Learning, pp. 773 782. PMLR, 2019.

David H Brookes and Jennifer Listgarten. Design by adaptive sampling. ar Xiv preprint ar Xiv:1810.03714, 2018.

Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbalanced

datasets with label-distribution-aware margin loss. ar Xiv preprint ar Xiv:1906.07413, 2019.

Alvin Chan, Yew-Soon Ong, Bill Pung, Aston Zhang, and Jie Fu. Cocon: A self-supervised approach

for controlled text generation. ICLR, 2021.

Kui K Chan, Danielle Dorosky, Preeti Sharma, Shawn A Abbasi, John M Dye, David M Kranz,

Andrew S Herbert, and Erik Procko. Engineering human ace2 to optimize binding to the spike protein of sars coronavirus 2. Science, 369(6508):1261 1265, 2020.

Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic

minority over-sampling technique. Journal of artiﬁcial intelligence research, 16:321 357, 2002.

Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss based on

effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9268 9277, 2019.

Payel Das, Tom Sercu, Kahini Wadhawan, Inkit Padhi, Sebastian Gehrmann, Flaviu Cipcigan,

Vijil Chenthamarakshan, Hendrik Strobelt, Cicero Dos Santos, Pin-Yu Chen, et al. Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations. Nature Biomedical Engineering, 5(6):613 623, 2021.

Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason

Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text

generation. ICLR, 2020.

Nicola De Cao and Thomas Kipf. Molgan: An implicit generative model for small molecular graphs.

ar Xiv preprint ar Xiv:1805.11973, 2018.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep

bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

Clara Fannjiang and Jennifer Listgarten. Autofocused oracles for model-based design. ar Xiv preprint

ar Xiv:2006.08052, 2020.

Jessica Ficler and Yoav Goldberg. Controlling linguistic style aspects in neural language generation.

ar Xiv preprint ar Xiv:1707.02633, 2017.

Salvador García and Francisco Herrera. Evolutionary undersampling for classiﬁcation with imbal-

anced datasets: Proposals and taxonomy. Evolutionary computation, 17(3):275 306, 2009.

Rafael Gómez-Bombarelli, Jennifer N Wei, David Duvenaud, José Miguel Hernández-Lobato,

Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2):268 276, 2018.

Alex Hawkins-Hooker, Florence Depardieu, Sebastien Baur, Guillaume Couairon, Arthur Chen,

and David Bikard. Generating functional protein variants with variational autoencoders. PLo S computational biology, 17(2):e1008736, 2021.

Brian L Hie and Kevin K Yang. Adaptive machine learning for protein engineering. ar Xiv preprint

ar Xiv:2106.05466, 2021.

Samuel Hoffman, Vijil Chenthamarakshan, Kahini Wadhawan, Pin-Yu Chen, and Payel Das. Optimiz-

ing molecules using efﬁcient queries from property evaluations. ar Xiv preprint ar Xiv:2011.01921, 2020.

Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. Toward controlled

generation of text. In International Conference on Machine Learning, pp. 1587 1596. PMLR, 2017.

Chen Huang, Yining Li, Chen Change Loy, and Xiaoou Tang. Learning deep representation for

imbalanced classiﬁcation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5375 5384, 2016.

Nitish Shirish Keskar, Bryan Mc Cann, Lav R Varshney, Caiming Xiong, and Richard Socher.

Ctrl: A conditional transformer language model for controllable generation. ar Xiv preprint ar Xiv:1909.05858, 2019.

Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hiroya Takamura, and Manabu Okumura. Controlling

output length in neural encoder-decoders. ar Xiv preprint ar Xiv:1609.09552, 2016.

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Ben Krause, Akhilesh Deepak Gotmare, Bryan Mc Cann, Nitish Shirish Keskar, Shaﬁq Joty, Richard

Socher, and Nazneen Fatema Rajani. Gedi: Generative discriminator guided sequence generation. ar Xiv preprint ar Xiv:2009.06367, 2020.

Aviral Kumar and Sergey Levine. Model inversion networks for model-based optimization. ar Xiv

preprint ar Xiv:1912.13464, 2019.

Ruizhe Li, Xiao Li, Chenghua Lin, Matthew Collinson, and Rui Mao. A stable variational autoencoder

for text modelling. ar Xiv preprint ar Xiv:1911.05343, 2019.

Ali Madani, Bryan Mc Cann, Nikhil Naik, Nitish Shirish Keskar, Namrata Anand, Raphael R Eguchi,

Po-Ssu Huang, and Richard Socher. Progen: Language modeling for protein generation. ar Xiv preprint ar Xiv:2004.03497, 2020.

Ali Madani, Ben Krause, Eric R Greene, Subu Subramanian, Benjamin P Mohr, James M Holton,

Jose Luis Olmos, Caiming Xiong, Zachary Z Sun, Richard Socher, et al. Deep neural language modeling enables functional protein generation across families. bio Rxiv, 2021.

Christopher D Manning, Christopher D Manning, and Hinrich Schütze. Foundations of statistical

natural language processing. MIT press, 1999.

Aria Mansouri Tehrani, Anton O Oliynyk, Marcus Parry, Zeshan Rizvi, Samantha Couper, Feng

Lin, Lowell Miyagi, Taylor D Sparks, and Jakoah Brgoch. Machine learning directed search for ultraincompressible, superhard materials. Journal of the American Chemical Society, 140(31): 9844 9853, 2018.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language

models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi

Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. ar Xiv preprint ar Xiv:1910.10683, 2019.

Joost Schymkowitz, Jesper Borg, Francois Stricher, Robby Nys, Frederic Rousseau, and Luis Serrano.

The foldx web server: an online force ﬁeld. Nucleic acids research, 33(suppl_2):W382 W388, 2005.

Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. Style transfer from non-parallel text

by cross-alignment. ar Xiv preprint ar Xiv:1705.09655, 2017.

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and

Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631 1642, 2013.

Baris E Suzek, Yuqi Wang, Hongzhan Huang, Peter B Mc Garvey, Cathy H Wu, and Uni Prot

Consortium. Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics, 31(6):926 932, 2015.

Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders.

ar Xiv preprint ar Xiv:1711.01558, 2017.

Dinara R Usmanova, Natalya S Bogatyreva, Joan Ariño Bernad, Aleksandra A Eremina, Anastasiya A

Gorshkova, German M Kanevskiy, Lyubov R Lonishin, Alexander V Meister, Alisa G Yakupova, Fyodor A Kondrashov, et al. Self-consistency test reveals systematic bias in programs for prediction change of stability upon mutation. Bioinformatics, 34(21):3653 3658, 2018.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi,

Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface s transformers: State-of-the-art natural language processing. ar Xiv preprint ar Xiv:1910.03771, 2019.

Zachary Wu, Kadina E Johnston, Frances H Arnold, and Kevin K Yang. Protein sequence design

with deep generative models. Current Opinion in Chemical Biology, 65:18 27, 2021.

Keyulu Xu, Mozhi Zhang, Jingling Li, Simon S Du, Ken-ichi Kawarabayashi, and Stefanie Jegelka.

How neural networks extrapolate: From feedforward to graph neural networks. ar Xiv preprint ar Xiv:2009.11848, 2020.

Yuzhe Yang, Kaiwen Zha, Ying-Cong Chen, Hao Wang, and Dina Katabi. Delving into deep

imbalanced regression. ar Xiv preprint ar Xiv:2102.09554, 2021.

Zichao Yang, Zhiting Hu, Chris Dyer, Eric P Xing, and Taylor Berg-Kirkpatrick. Unsupervised text

style transfer using language models as discriminators. ar Xiv preprint ar Xiv:1805.11749, 2018.

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul

Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. ar Xiv preprint ar Xiv:1909.08593, 2019.