# salsa_semanticallyaware_latent_space_autoencoder__520ec0a8.pdf SALSA: Semantically-Aware Latent Space Autoencoder Kathryn E. Kirchoff1, Travis Maxfield2, Alexander Tropsha2*, Shawn M. Gomez3, 4 1Department of Computer Science, UNC Chapel Hill 2Eshelman School of Pharmacy, UNC Chapel Hill 3Department of Pharmacology, UNC Chapel Hill 4Joint Department of Biomedical Engineering at UNC Chapel Hill and NC State University kat@cs.unc.edu, tmaxfield@unc.edu, alex tropsha@unc.edu, smgomez@unc.edu In deep learning for drug discovery, molecular representations are often based on sequences, known as SMILES, which allow for straightforward implementation of natural language processing methodologies, one being the sequenceto-sequence autoencoder. However, we observe that training an autoencoder solely on SMILES is insufficient to learn molecular representations that are semantically meaningful, where semantics are specified by the structural (graph-tograph) similarities between molecules. We demonstrate by example that SMILES-based autoencoders may map structurally similar molecules to distant codes, resulting in an incoherent latent space that does not necessarily respect the semantic similarities between molecules. To address this shortcoming we propose Semantically-Aware Latent Space Autoencoder (SALSA) for molecular representations: a SMILES-based transformer autoencoder modified with a contrastive task aimed at learning graph-to-graph similarities between molecules. To accomplish this, we develop a novel dataset comprised of sets of structurally similar molecules and opt for a supervised contrastive loss that is able to incorporate full sets of positive samples. We evaluate semantic awareness of SALSA representations by comparing to its ablated counterparts, and show empirically that SALSA learns representations that maintain 1) structural awareness, 2) physicochemical awareness, 3) biological awareness, and 4) semantic continuity. Introduction In drug discovery, learning the underlying semantics that govern molecular data presents an interesting challenge for deep learning. Effective learning of semantics is necessary to be successful in key tasks such as property prediction and de novo generation, and progress has been made in attempting to solve these tasks (Bilodeau et al. 2022). However, due to the ambiguous nature of molecular representations, models often fail to adequately capture the underlying semantics resulting in a disorganized latent space. In the case of molecular data, semantics is often taskdependent but may amount to various emergent properties (e.g. structural, physicochemical, and biological properties) *Corresponding author Corresponding author Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: Given molecule (A) we consider three molecules whose graphs are structurally similar, being a single graph edit from (A). The naive autoencoder maps these similar molecules to latent codes of various proximity: (1) is mapped close to (A), while (3) is mapped far from (A). In contrast, our proposed autoencoder, SALSA, learns a semantically-aware space such that structurally similar molecules are collectively mapped to nearby codes. that are intrinsically linked to molecular structure, that is, the arrangement of constituent atoms and bonds (Honda et al. 2016). Molecular structure can be captured in the form of a graph, and thus the semantics that govern chemical manifolds may therefore be specified by the graph-to-graph similarities (i.e. structural similarities) between molecules. In this way, graph edit distance (GED) defines a semantically meaningful unit of change between molecular entities. In molecular representation learning, we can conveniently express molecular structures as linear sequences, known as simplified molecular-input line-entry system (SMILES) strings (Weininger 1988) in order to take advantage of recent progress in sequence-to-sequence modeling. Borrowing from advancements made in natural language processing (NLP), many autoencoder-based methods operating on SMILES sequences have been proposed as they provide The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) a promising framework to solve problems in drug discovery (Alperstein, Cherkasov, and Rolfe 2019; G omez Bombarelli et al. 2018; Bilodeau et al. 2022). However, these SMILES-based autoencoders are plagued by some of the same challenges met in the field of NLP, namely, difficulties in learning latent spaces that capture underlying sentence semantics (Xu et al. 2021; Shen et al. 2020). This arises from the fact that for discrete objects such as sentences, autoencoders have the capacity to map similar data to distance latent representations. We observe an analogous problem in that SMILES-based autoencoders are not able to adequately learn the structure-based semantics that underlie chemical datasets, and as a result, these models may map semantically similar molecules to distant codes in the latent space. This phenomenon is more precisely defined as an instance in which structurally similar molecules (low GED) are mapped to distant latent representations (high Euclidean distance). We show an example of this in Figure 1. Collectively, many of these semantically naive events induce a disorganized latent space which limits success of these models in downstream tasks. To remedy this shortcoming of SMILES-based autoencoders, we propose enforcing a sense of semantic awareness on to an autoencoder such that structurally similar molecules are mapped near one another in the latent space. Our proposed model, Semantically-Aware Latent Space Autoencoder (SALSA), is a modified SMILES-based transformer autoencoder that, in addition to a canonical reconstruction loss, learns a contrastive task having the objective of mapping structurally similar molecules, whose graphs are separated by a single edit distance, to similar codes in the effected latent space. In this way, we are able to learn a semantically meaningful latent space. We compare SALSA to its two ablations (a naive SMILES autoencoder and a contrastive encoder) and evaluate their latent spaces in terms of not only structural awareness, but also physicochemical and biological awareness as well as semantic continuity. We are the first, to our knowledge, to enforce structural awareness onto a SMILES-based model. Our contributions are as follows: We propose a novel modeling framework, SALSA, that composes a transformer autoencoder with a contrastive task to achieve semantically-aware molecular representations. We develop a scheme for constructing a chemical dataset suited to contrastive learning of molecular entities, specifically aimed at learning structural similarities between molecules. We evaluate the quality of SALSA s latent space based on: 1) structural awareness, 2) physicochemical awareness, 3) biological awareness, and 4) semantic continuity. Related Works Sequence-Based Models. For our sequence-based (i.e. SMILES-based) representation, we are specifically interested in methods that allow for global representation of sequence inputs. Earlier methods aimed at embedding whole sequences utilized recurrent neural networks (RNNs), including long short-term memory networks (LSTMs), naturally aligned to this objective (Bowman et al. 2016; Shen et al. 2020). However, most state-of-the-art methods are based on the original transformer architecture (Vaswani et al. 2017) and do not provide a global representation of the input. Recently, authors have modified the transformer architecture to include a bottleneck (or pooling) layer allowing for a single, fixed-size global embedding of the input (Montero, Pappas, and Smith 2021; Jiang et al. 2020; Li et al. 2020). Examples of autoencoder-based methods include Chem VAE (G omez-Bombarelli et al. 2018) and All SMILES VAE (Alperstein, Cherkasov, and Rolfe 2019). Transformerbased models include Chem BERTa (Chithrananda, Grand, and Ramsundar 2020), SMILESTransformer (Honda, Shi, and Ueda 2019), and Frag Net (Shrivastava and Kell 2021). Less common, however, is the composed architecture of a transformer autoencoder. Contrastive Learning. For molecular data, both SMILES and graph representations have been explored in the context of contrastive learning. The Frag Net model proposed by Shrivastava and Kell (2021) utilized the normalized temperature-scaled cross entropy (NT-Xent) (Sohn 2016) loss to map enumerated SMILES of identical molecules nearby in the latent space. Insofar as graphs, Wang et al. (2021) similarly used the NT-Xent loss to maximize the agreement between pairs of augmented graphs ( views ) describing the same molecule; here, each view (i.e. positive sample) is obtained by masking out nodes or edges. The NT-Xent loss, although widely successful, operates solely on positive pairs, an issue addressed by Khosla et al. (2020) in their formulation of the Supervised Contrastive (Sup Con) loss which allows for comparison among an arbitrarily sized set (rather than a pair) of positive instances. Methodology Overview of Approach Broadly, our goal is to impart semantic awareness onto a SMILES-based transformer autoencoder such that the effected latent representations better respect the structural similarities, particularly the graph edit distance (GED), between molecular pairs. We do this by incorporating a contrastive component into the architecture that differentiates similar and dissimilar molecular graphs. Contrastive Objective. Our contrastive task necessitates known pairs of similar and dissimilar molecules. We opt to consider as similar any two molecules separated by a single graph edit. Recall that the graph edit distance (GED) between two molecules, viewed as labeled graphs, is the minimum number of single edits required to make one graph isomorphic to the other. It is computationally infeasible to obtain all pairs of single GED molecules systematically from an existing dataset. To sidestep this issue, we generate a bespoke dataset of 1-GED molecular pairs. We accomplish this by defining a set of node-level transformations, or mutations, which are applied to anchor molecules The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) to obtain similar (1-GED) molecules which we will refer to as mutants . Autoencoder Component. We specify our SMILESbased autoencoder with a transformer encoder and decoder, and introduce an intermediate bottleneck in order to obtain fixed-length vector representations. Combined with the contrastive component, the general framework is encapsulated in Figure 2. We note that an encoder trained solely on the contrastive objective, that is, without the reconstruction loss central to an autoencoder, may learn a degenerative mapping such that our designated similar molecules are mapped to representations that are in fact too similar, being almost stacked on top of one another. In this way, the reconstruction loss provided by the autoencoder component acts as a regularizer (on the contrastive loss) that encourages similar molecules to be mapped to distinct codes. Training Dataset Anchor Compounds. We utilize the dataset developed by Popova, Isayev, and Tropsha (2018), which contains approximately 1.5 million SMILES sequences sourced from the Ch EMBL database (version Ch EMBL21), a chemical database comprised of drug-like or otherwise biologicallyrelevant molecular compounds (Bento et al. 2014). After procuring the full dataset, the set of compounds was run through a standard curation pipeline; for an in-depth description of the curation process, please refer to Popova, Isayev, and Tropsha (2018). We further filter the dataset by SMILES length, allowing only molecules with SMILES length less than or equal to 110 characters, leaving 1,256,277 compounds. These compounds constitute the full set of anchors from which we will generate 1-GED mutant compounds, further explained in the following section. Generating Mutant Compounds. We define a molecular graph generally as g = (V, E) where V = {v0, ..., v A} is the set of nodes, where each va {C, O, N, S, Br, Cl, I, F, P, B, $} (atom types), and E = {(va, vb)|va, vb V} is the set of edges (bonds). Note that atom type $ is a stand-in for any atom type not in the remaining list, analogous to an unk character in natural language models. Here, we will differentiate anchors from mutants with a tilde, i.e. anchor graphs as g and mutant graphs as g. Given an anchor, we consider its graph, gi G where G is the anchor set (sourced from Ch EMBL) and i is the index identifying the anchor in G. We obtain a mutated graph, or mutant, by randomly sampling a mutation operator t( ) T and applying that mutation to the anchor, t(gi) = gi(j) where i again identifies the original anchor, and j is the index of the mutant graph within the anchors positive sample set. Our set of mutation (graph transformation) operators, T = {Add, Replace, Remove}, is rationally defined to avoid transformations that would drastically alter graph topology, i.e. separating a molecule into disconnected graphs or breaking and forming rings. Furthermore, we require mutants to be chemically valid molecular graphs, and we normalize all SMILES using the RDKit canonicalization Figure 2: Overview of SALSA architecture. SALSA operates on multi-mutant batches, but here, we show a single (positive) anchor mutant pair for simplicity. The reconstruction loss (Lr) is computed on the output sequence probabilities. In the case of a positive pair (similar molecules) as shown, the contrastive loss (Lc) aims to push their normalized representations close together in the latent space. Note that weights between the two networks are shared, thus, only a single model is trained and used for inference. algorithm (RDKit 2023). Given these specifications, our mutation operators are defined as follows. Node addition (Add): Append a new node, and a corresponding edge, to an existing node in the graph. Node substitution (Replace): Change the atom type of an existing node in the graph. Node deletion (Remove): Remove a singly-attached node and its corresponding edge from the graph. For both Add and Replace, incoming atom types are drawn from the observed atom type distribution in the original Ch EMBL dataset. For each anchor, gi, we generate 10 distinct mutants that constitute the positive sample set, P(i), for that anchor: P(i) = { gi(1), gi(2), . . . , gi(10)} G (1) Our final training set is made up of the entire set of anchors and their respective mutants, amounting to 13,819,047 total training compounds. We show an example of a batch The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Figure 3: An example batch from the mutated dataset composed of three anchors, g1, g2, g3, and their respective sets of mutants (positive samples), P(i) = { gi(1), gi(2), gi(3), gi(4)}. Negative samples are defined between anchors and all other molecules in the batch not in that anchor s set P(i). Colored atoms of mutant compounds correspond to single graph edits from anchor to mutant: Add (green), Replace (blue), and Remove (red). composed of three anchors, with five mutants per anchor in Figure 3. Faulty-Positive Filtering. Although our mutation operators ensure chemical validity, they do not ensure physicochemical proximity of mutants to anchors. Due to the complex nature of quantum mechanics underlying molecular interactions, a single graph edit mutation may effect great differences in the physicochemical properties between anchor and mutant. We circumvent such phenomena by filtering out mutants that are too dissimilar from their respective anchor based on the Mahalanobis distance between the physicochemical properties of an anchor and those of its mutants. Mahalanobis distance between an anchor gi and mutant gi(j) is defined as: d M gi, gi(j) = q xi xi(j) T Σ 1 xi xi(j) (2) where xi and xi(j) are the physicochemical property vectors for gi and gi(j), respectively. The covariance matrix, Σ, corresponds to the distribution of physicochemical properties computed over initial anchor set G. We computed physicochemical properties corresponding to the standard collection of RDKit descriptors, and then filtered out descriptors having any invalid property values in order to obtain a realvalued property vector for each molecule. Modeling Framework The core architecture of SALSA is based on the encoderdecoder transformer paradigm proposed by Vaswani et al. (2017), with an additional autoencoder specification. The SALSA transformer takes SMILES sequences as input and additionally considers the similarity relationships between those SMILES inputs (denoted either similar or dissimilar ), as determined by the structural similarity of their corresponding molecular graphs. SMILES Input. While the original transformer operated on natural language sequences, SALSA operates on SMILES sequences corresponding to molecular graphs. A SMILES (simplified molecular-input line-entry system) sequence is an ordered list of the atom and bond types encountered during a depth-first traversal of a spanning tree of the associated molecular graph (e.g. the SMILES sequence of ibuprofen is CC(Cc1ccc(cc1)C(C(=O)O)C)C ). We adopt a simple tokenization strategy yielding a vocabulary of 39 tokens, including the most common atom and bond types present in drug-like organic molecules in addition to a start token , < , end token, > , pad token, X , and an unknown token, $ , used in cases where SALSA encounters an atom type not present in the provided vocabulary. SALSA Architecture. We modify the original transformer architecture into an autoencoder aiming to reproduce the original input. This is accomplished by introducing a pooling layer and a subsequent upsampling layer between the encoder and decoder, and in this way imposing an autoencoder bottleneck that produces fixed-size latent representations. Specifically, whereas the intermediate output of the original transformer encoder is a vector of size RL H for a sequence of length L and hidden dimension size H, SALSA s encoder is designed to output a latent vector of fixed size RS. This is accomplished by first applying a component-wise mean pooling from RL H RH before projecting RH RS. The SALSA latent vector is constrained to live on the unit hypersphere embedded in RS, and so we therefore normalize the output of the Pooling layer. It is the output of the Pooling layer, z RS, that is input into the contrastive loss, explained in the next section. Then, as the transformer decoder is designed to accept an input of size equal to the output of the encoder, i.e. RL H, we first pass the latent vector through a linear layer with the appropriate output dimension and reshape as needed before passing to the decoder. This is referred to as Upsample in Figure 2. Note that this method of injecting the latent vector into the transformer decoder resembles the method called memory in Li et al. (2020), where it was demonstrated to yield superior results over an alternative strategy. Loss Function. We define a compound loss function, composed of: (1) a contrastive objective defined over a batch of inputs, and (2) an reconstruction task, characteristic of sequence autoencoders. For our contrastive objective, we adapt the supervised contrastive (Sup Con) loss (Khosla et al. 2020). The Sup Con loss allows for multiples positive comparisons per anchor, resulting in improved performance rel- The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) ative to naive contrastive losses, which operate on the assumption of only a single positive sample per anchor. The Sup Con loss is defined as p P (i) log exp (zi zp/τ) P a A(i) exp (zi za/τ), (3) where I is the set of anchors in a batch, A(i) is the set of all samples sharing a batch with anchor i, having latent code zi, and P(i) are those elements of A(i) that are similar to i, and I is the set of anchors in the batch, using the terminology of Section . The autoencoder, operating on SMILES, is trained with a reconstruction loss with causal masking. For a single SMILES sequence si and its associated latent vector zi, the loss is defined as: t=1 log pθ(s(t) i |zi, s(