# learning_chemical_rules_of_retrosynthesis_with_pretraining__b92c9f39.pdf Learning Chemical Rules of Retrosynthesis with Pre-training Yinjie Jiang1, Ying Wei2*, Fei Wu1,3 , Zhengxing Huang1, Kun Kuang1, Zhihua Wang3 1 Zhejiang University 2 City University of Hong Kong 3 Shanghai Institute for Advanced Study of Zhejiang University {jiangyinjie, wufei, zhengxinghuang, kunkuang, zhihua.wang}@zju.edu.cn, yingwei@cityu.edu.hk, Retrosynthesis aided by artificial intelligence has been a very active and bourgeoning area of research, for its critical role in drug discovery as well as material science. Three categories of solutions, i.e., template-based, template-free, and semitemplate methods, constitute mainstream solutions to this problem. In this paper, we focus on template-free methods which are known to be less bothered by the template generalization issue and the atom mapping challenge. Among several remaining problems regarding template-free methods, failing to conform to chemical rules is pronounced. To address the issue, we seek for a pre-training solution to empower the pretrained model with chemical rules encoded. Concretely, we enforce the atom conservation rule via a molecule reconstruction pre-training task, and the reaction rule that dictates reaction centers via a reaction type guided contrastive pre-training task. In our empirical evaluation, the proposed pre-training solution substantially improves the single-step retrosynthesis accuracies in three downstream datasets. Introduction Firstly formulated by Corey and Wipke (1969), Retrosynthesis is the task to design synthesis routes of target products, which plays a significant role in chemistry and pharmacy. With the increasing number of chemical reactions, even experienced and professional chemists spend much time on retrosynthesis. Automated retrosynthesis machines are urgently needed. Thus, single-step retrosynthesis prediction which serves as the foundation of the retrosynthesis route planning becomes a critical juncture for the intersection of machine learning and chemistry. Existing machine learning works on retrosynthesis prediction are mainly divided into three categories: templatebased methods (Segler and Waller 2017; Dai et al. 2019; Sun et al. 2021; Seidl et al. 2021), semi-template-based methods (Yan et al. 2020; Shi et al. 2020; Somnath et al. 2021; Seo et al. 2021), and template-free methods (Schwaller et al. 2020; Zhu et al. 2021; Sun et al. 2021). Templatebased and semi-template-based methods heavily depend on templates or Atom-Atom-Mappings, which are also unsolved challenges in chemistry. Meanwhile, the performance *Corresponding authors. Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. of template-free methods is barely satisfactory because of the lack of additional chemical information. After taking a closer look at the results of previous models, we pinpoint three challenges in retrosynthesis prediction, especially for template-free methods (see Figure ??): 1) generated molecules are invalid. 2) generated reactants break Law of Conservation of Atoms. 3) generated reactants do not react or do not produce the target product. To address the three challenges, we propose a Pre-trained Model for Single-step Retrosynthesis (PMSR) where we design three pre-training tasks. Besides auto-regression, we propose a molecule recovery task with regional masks for the first two challenges. The masked elements are recovered by other surrounding visible elements, which helps the model generate valid molecules. These masked elements are also expected to be predicted by the given product, encouraging the model to follow the conservation of atoms. Additionally, it is widely accepted that the reaction type as prior knowledge greatly improves the performance of retrosynthesis. Thus, we propose a supervised contrastive task in PMSR to force the model to focus more on reaction centers. Our main contributions can be summarized as follows. We summarize three challenges of single-step retrosynthesis prediction and propose three solutions to these challenges, i.e., masked element recovery, masked fragment recovery, and reaction classification. We design three pre-training tasks customized to retrosynthesis, including auto-regression, molecule recovery, and contrastive rection classification. The three pretraining tasks solve the three challenges and improve the performance of retrosynthesis. We also introduce the pointer-generator architecture and data augmentation in PMSR, both of which further benefit retrosynthesis. After fine-tuning on USPTO-50K (Schneider, Stiefl, and Landrum 2016), USPTO-FULL (Dai et al. 2019) and Pistachio (Mayfield, Lowe, and Sayle 2017), our model surpasses previous methods by a large margin. We also conduct experiments to generate all precursors, including reactants and reagents, on USPTO-50K (Schneider, Stiefl, and Landrum 2016) and USPTO-MIT (Jin et al. 2017). PMSR also achieves satisfactory results, which shows the power of our pre-training tasks. The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23) Related Work Single-Step Retrosynthesis Prediction Template-Based Methods Template-based retrosynthesis prediction aims to prioritize different reaction templates in different ways. Retro Sim (Coley et al. 2017) compares the molecular similarity to select templates. Neural Sym (Segler and Waller 2017) constructs a classification task to choose templates. GLN (Dai et al. 2019) maximizes the conditional joint probability of both templates and the reactants using their learned graph embeddings. Additionally, Dual TB (Sun et al. 2021) introduces an energy-based model in GLN. MHN (Seidl et al. 2021) uses modern Hopfield networks to associate different molecules and templates, which improves the performance of template relevance prediction. Local Retro (Chen and Jung 2021) tries to extract more general templates only with local information. All these methods suffer from the low generalization of templates as well as the huge and increasing number of templates. Semi-Template-Based Methods Semi-template-based methods resort to the reaction centers identified by Atom-Atom-Mappings (AAM), in which atoms in a product is mapped to those in the corresponding reactants. Retro Expert (Yan et al. 2020), G2Gs (Shi et al. 2020) and Graph Retro (Somnath et al. 2021) predict reaction centers to generate synthons first and then complete synthons to reactants. MEGAN (Sacha et al. 2021) modifies the product step by step to generate reactants with a graph-to-sequence model. GTA (Seo et al. 2021) generates reactants via a transformer trained with the cross-attention MSE calculated by AAM. While all the above methods rely on the correctness of AAM, automated AAM remains an open problem in chemistry (Jaworski et al. 2019; Schwaller et al. 2021a). Template-Free Methods Template-free methods train sequence-to-sequence models to generate SMILES strings of reactants directly. MT (Schwaller et al. 2020, 2019) firstly uses Transformer in retrosynthesis prediction. SCROP (Zheng et al. 2019) designs a syntax correcter to improve the correctness of generated strings. Dual TF (Sun et al. 2021) formulates retrosynthesis by energy-based models and adds an additional loss of forward prediction. Our method aims to overcome the particular challenges in template-free methods summarized in the Introduction by introducing more chemical information to the model. Chemical Pre-training Transformer (Vaswani et al. 2017) is wildly used in the NLP area and has achieved tremendous success combined with the paradigm of pre-training. SMILES representation of chemical molecules opens the door to pre-train molecular transformer-based models. Chem Berta (Chithrananda, Grand, and Ramsundar 2020) transfers Roberta (Liu et al. 2019) to the chemical area. X-MOL (Xue et al. 2021), SMILES-BERT (Wang et al. 2019) and Mol Bert (Fabian et al. 2020) introduce chemical features to the transformerbased pre-training model. DMP (Zhu et al. 2021) pre-trains a transformer-based model together with a graph-based model. However, these models only work with molecules instead of reactions, so that they cannot learn the chemical rules in reactions. Rxnfp (Schwaller et al. 2021b) attempts to map the space of chemical reactions with a transformerbased pre-training encoder. T5Chem (Lu and Zhang 2022) is the most related work, which directly adapts the T5 framework and fine-tunes on multiple tasks. Mol R (Wang et al. 2022) pre-trains a GNN encoder for molecules with reactions. Different from the above works, our work focuses on reaction-level tasks including retrosynthesis with reactions as training data; it is the first to pre-train a sequence-tosequence model for reactions with chemically-targeted pretraining tasks. Single-step Retrosynthesis Sequence-to-Sequence Based Single-Step Retrosynthesis Prediction We use Simplified Molecular Input Line Entry System (SMILES) (Weininger 1988), which represents a threedimensional molecule formula as a one-dimensional string. As a result, a chemical reaction is represented as two strings a precursor string and a product string. We denote (x, y) (X, Y) as a chemical reaction, where x = (x1, x2, , xm) is the target product represented by a m-token SMILES string and the y = (y1, y2, , yn) denotes the precursors of the reaction with n tokens. In retrosynthesis prediction, the products X is the source domain and the precursors Y is the target domain. The objective function of a retrosynthesis prediction model is L(ΞΈ; (X, Y)) = X (x,y) (X,Y) log P(y|x; ΞΈ). (1) More specifically, the retrosynthesis prediction is more likely to be a conditional generation than a translation, because X and Y share the same vocabulary and all atoms in the product appear in the precursors in a reaction. Challenges in Single-Step Retrosynthesis We evaluate several single-step retrosynthesis models and thereupon pinpoint three major types of errors in single-step retrosynthesis, as shown in Figure 1. Basic Chemical Rules of Molecules All molecules are expected to obey the basic valence bond theory. For example, a fluorine atom should never connect with three other atoms, as the invalid molecule shown in Figure 1(a). As a generative problem, it is necessary to learn chemical rules so that generated molecules are valid. However, these chemical rules are implicit and models do not produce valid molecules exactly according to the rules. Conservation in Reactions We observe that many incorrect results generated by single-step retrosynthesis prediction models, especially by template-free models, do not allow Law of Conservation of Atoms. That is, the parts of the reactants other than the reaction center change in the reaction. For example, in Figure 1(b), the methyl group is mistakenly attached to the ortho position of the carboxyl group, which is completely different from the product. (a) A wrong molecule COOH COOCH3 (b) A wrong conservation of atoms (c) A wrong reaction center (d) Masked element recovery COOH COOCH3 (e) Masked fragment recovery (f) Reaction classification Figure 1: Three kinds of errors in retrosynthesis and their solutions. Selectivity of Reaction Centers The most challenging problem lies in selecting the correct reaction center which is a small region in a product. Given a target product, there usually exist several candidate reaction centers, but some of them being infeasible are against the mechanism of chemical reactions. On the benzene ring in Figure 1(c), a methoxy group cannot be attached to the para position of the acetyl group. Therefore, it is necessary for the model to also learn such chemical knowledge to avoid this kind of mistakes. These challenges in single-step retrosynthesis prediction are all about chemical rules. In order to learn generalized and correct chemical rules, pre-training on a large dataset with chemically informed pre-training tasks serves as a viable solution. To this end, we propose a pre-trained model for single-step retrosynthesis. Reagents Generation Only Schwaller et al. (2020) attempted the concurrent prediction of reactants and reagents (e.g. solvents and catalysts). However, reagents are significant conditions of reactions. Both theoretically and practically, reagent prediction is necessary for reaction validation and automatic experiments. Meanwhile, it is more difficult to generate all precursors including reactants and reagents because of the variety of reagents. Methodology Solutions to the Challenges of Retrosynthesis Generation of Valid Molecules The model that generates a wrong atom or a wrong bond in a certain position of a molecule is not desired. Thus, we design a masked element recovery task. Given a molecule, we mask some atoms and bonds of it and recover these masked elements by the model. This task helps the model learn the basic rules of molecules and avoid illegal atoms or bonds. For example, as shown in Figure 1(d), the model would not recover the masked element with a fluorine atom. In addition, more training data also improve the quality of generated molecules. Conservation of Products and Precursors In retrosynthesis prediction, the most cases that break the conservation law are that functional groups in the product are dislocated in the generated precursors. Therefore, we design a masked fragment recovery task, in which the model recovers masked consecutive elements of precursors given the corresponding product. Each masked segment contains several atoms and even functional groups, and the model needs to place them in a correct order according to the given product. Different from the masked element recovery task, the masked fragment recovery task encourages the model to keep the consistency between the product and the precursors. For example, while the methyl can attach to either the ortho or the meta position of the carboxyl group in Figure 1(e), only the meta position is reasonable according to the product. Besides, as for rare atoms, we propose to use the pointergenerator (See, Liu, and Manning 2017; Nishida et al. 2019) which offers opportunities to copy an element directly from the product. The copying mechanism helps to generate some rare atoms that appear in the product rather than miss them. Selection of Correct Reaction Centers Empirical results of all previous works (Dai et al. 2019; Coley et al. 2017) witness a considerable performance improvement after adding the information of reaction types. It is because the reaction type categorized by patterns of reaction centers (Schneider et al. 2016) provides invaluable insight into reaction centers, and the performance of retrosynthesis prediction is largely dependent on the correctness of the identified reaction center of a reaction. Thus, a reaction classification task suffices to teach the model to focus more on reaction centers only if the model focus on reaction centers, it can correctly predict the reaction type. In Figure 1(f), the most likely reaction type is the C-C bond formation so that the methoxy group is almost impossible to act as the reaction center. Data Augmentation Single-step retrosynthesis prediction usually uses the canonical SMILES strings (Schwaller et al. 2019) to limit the randomness of generation. However, Tetko et al. (2020) proposed that random representation of molecules could be an augmentation method for retrosynthesis. More training data also helps the model to learn chemical rules. Further, we find that canonicalized SMILES representation does not maintain the consistency of the same sub-graph in different molecules. In other words, canonicalized SMILES benefits the generation of unique results but harms the learning of structural information from SMILES strings. During pretraining, we thus provide different SMILES strings of the same structure, which helps the model to learn the equivalence between different strings. We follow the augmentation method proposed in Tetko et al. (2020) by including random SMILES representations of products, precursors, and reverse precursors and products. An Overview of PMSR As shown in Figure 2, PMSR is a sequence-to-sequence chemical reaction model with a transformer-based encoder and a transformer-based decoder. We argue that the retrosynthesis prediction task is more like conditional generation, Decoder Encoder Tok m Tok 1 Masked SMILES (Source) Masked SMILES Mask LM Mask Auto Regression Tok 1 Tok n 𝐸𝐸[CLS] 𝐸𝐸𝑛𝑛 𝑇𝑇[<\𝑆𝑆>] 𝑇𝑇𝑛𝑛 𝑇𝑇1 Decoder Encoder Tok m Tok 1 Masked SMILES (Source) Masked SMILES Mask LM Mask Auto Regression Tok 1 Tok n 𝐸𝐸[CLS] 𝐸𝐸𝑛𝑛 𝑇𝑇[<\𝑆𝑆>] 𝑇𝑇𝑛𝑛 𝑇𝑇1 Decoder Encoder Tok m Tok 1 Masked SMILES (Source) Masked SMILES Mask LM Mask Auto Regression Tok 1 Tok n 𝐸𝐸[CLS] 𝐸𝐸𝑛𝑛 𝑇𝑇[<\𝑆𝑆>] 𝑇𝑇𝑛𝑛 𝑇𝑇1 Supervised Contrastive Learning Figure 2: An overview of PMSR. so that we adopt the pointer-generator (See, Liu, and Manning 2017; Nishida et al. 2019) in our model. The pointergenerator architecture allows the model to copy atoms from a product directly, improving the generation of rare atoms. We design three pre-training tasks, i.e., molecule recovery (MR), auto-regression (AR) and contrastive classification (CC). MR and AR are pre-trained on both the encoder and decoder; CC is learned in a batch of data, which is like regularization of the model. These three pre-training tasks are optimized simultaneously. Pre-training Tasks and Fine-tuning Auto-regression Our pre-training data are purely reactions, so that we first design a supervised auto-regression task which is the same as our downstream tasks of retrosynthesis. In the auto-regression task, we have LAR = log P(y|x) = t=1 log P(yt|y