# domainagnostic_molecular_generation_with_chemical_feedback__05ac351c.pdf Published as a conference paper at ICLR 2024 DOMAIN-AGNOSTIC MOLECULAR GENERATION WITH CHEMICAL FEEDBACK Yin Fang , Ningyu Zhang , Zhuo Chen , Lingbing Guo , Xiaohui Fan , Huajun Chen College of Computer Science and Technology, Zhejiang University ZJU-Ant Group Joint Research Center for Knowledge Graphs, Zhejiang University ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University {fangyin, zhangningyu, zhuo.chen, lbguo, fanxh, huajunsir}@zju.edu.cn The generation of molecules with desired properties has become increasingly popular, revolutionizing the way scientists design molecular structures and providing valuable support for chemical and drug design. However, despite the potential of language models in molecule generation, they face challenges such as generating syntactically or chemically flawed molecules, having narrow domain focus, and struggling to create diverse and feasible molecules due to limited annotated data or external molecular databases. To tackle these challenges, we introduce MOLGEN, a pre-trained molecular language model tailored specifically for molecule generation. Through the reconstruction of over 100 million molecular SELFIES, MOLGEN internalizes structural and grammatical insights. This is further enhanced by domain-agnostic molecular prefix tuning, fostering robust knowledge transfer across diverse domains. Importantly, our chemical feedback paradigm steers the model away from molecular hallucinations , ensuring alignment between the model s estimated probabilities and real-world chemical preferences. Extensive experiments on well-known benchmarks underscore MOLGEN s optimization capabilities in properties such as penalized log P, QED, and molecular docking. Additional analyses confirm its proficiency in accurately capturing molecule distributions, discerning intricate structural patterns, and efficiently exploring the chemical space1. 1 INTRODUCTION Molecule generation synthesizing and designing novel molecules with desirable properties holds an important place in chemical science, with numerous applications in drug discovery (Wang et al., 2022). Generating molecules is challenging due to the immense and discrete nature of the molecular space, which, with an estimated size of 1033, makes exhaustive searches impractical (Polishchuk et al., 2013). Early, deep generative models (Jin et al., 2020; Zang & Wang, 2020; Luo et al., 2021; Shi et al., 2020b) have emerged as one of the most promising tools for exploring the broader synthetically accessible chemical space. These models ability to automatically generate chemically valid and structurally similar molecules has proven to be invaluable for tasks such as the inverse design of functional compounds (Flam-Shepherd et al., 2022). Current deep generative models typically involve initial training of an unconditional generative model through a large set of existing molecules, and then use additional reward functions (Cao & Kipf, 2018; Popova et al., 2018; You et al., 2018; Popova et al., 2019; Shi et al., 2020b; Zang & Wang, 2020) or property predictors (Liu et al., 2018; Jin et al., 2019; G omez-Bombarelli et al., 2018) to guide the synthesis of new molecules with desired properties. However, these approaches are limited by challenges in training due to the high variance of Reinforcement Learning (RL) (Xie et al., 2021), fixed-dimensional latent generation space (Wang et al., 2023), and expert-provided generation rules (Sun et al., 2022), which impede efficient exploration of the broader chemical space. Corresponding author. 1Code is available at https://github.com/zjunlp/Mol Gen. Published as a conference paper at ICLR 2024 Recent advancements in language models have demonstrated great potential for understanding complex molecular distributions (Flam-Shepherd et al., 2022). To gain a more profound comprehension of the underlying molecular structures and their representations, researchers have begun integrating SMILES (Weininger, 1988), a linear string notation for describing molecular structures, with pretrained language models (PLMs) (Irwin et al., 2022). Despite their widespread use, several issues remain inadequately considered. Firstly, the brittleness of SMILES may lead to a high proportion of generated chemically invalid strings, either due to syntactic errors (e.g., not corresponding to molecular graphs) or fundamental chemical principle violations (e.g., exceeding the maximum number of inter-atomic valence bonds) (Krenn et al., 2020). Secondly, almost all previous studies have focused primarily on synthetic molecules, neglecting natural products (Du et al., 2022a). Notably, natural products, characterized by enormous scaffold diversity and structural complexity, exhibit a distinct distribution compared to synthetic molecules and confer additional challenges for numerous molecule generation applications such as drug discovery (Atanasov et al., 2021). Thirdly, pre-trained molecular language models often succumb to molecular hallucinations . This refers to instances where the generated molecules structurally adhere to chemical rules, yet fail to demonstrate the anticipated chemical activity in practical applications. This occurs because, although the models assimilate a vast array of molecular structural representations during pre-training, yet they might not fully capture the complex relationships with real-world chemistry and biological properties. Some methods attempt to mitigate this issue by using supervised fine-tuning or external databases (Irwin et al., 2022; Wang et al., 2023), but they may constrain the direction of molecular optimization. Figure 1: MOLGEN excels at generating chemically valid molecules with expected efficacy in both synthetic and natural product domains. To tackle these challenges, we present MOLGEN, a novel pre-trained molecular language model designed for efficient molecule generation. As illustrated in Figure 1, our approach comprises: (i) A two-stage domain-agnostic molecular pre-training. First, we train bidirectional and auto-regressive Transformers (Vaswani et al., 2017) to reconstruct over 100 million corrupted molecular SELFIES (Krenn et al., 2020). This endows the model with a profound understanding of the structure, grammar, and intrinsic semantic information of SELFIES, an entirely robust molecular language, free from the predicaments of syntactic and semantic inconsistency often associated with conventional SMILES notation. Next, we leverage domain-agnostic molecular prefix tuning, enabling MOLGEN to harness knowledge transferable across diverse domains (i.e., synthetic and natural products), facilitating task adaptation. (ii) A chemical feedback paradigm to alleviate molecular hallucinations . By aligning the model s generative probabilities with real-world chemical preferences, MOLGEN learns to evaluate and rectify its molecular outputs, ensuring the generation of chemically valid molecules with genuine utility and anticipated properties. Through extensive testing on both synthetic and natural product molecular datasets, we establish MOLGEN s capability in producing chemically valid molecules, navigating chemical spaces efficiently, and achieving notable optimization in properties like penalized logp, QED, and molecular docking. Our further analysis underscores MOLGEN s adeptness at understanding complex molecular distributions, recognizing meaningful substructures, and the efficacy of the chemical feedback mechanism, offering novel perspectives and tools to the molecular generation community. 2 METHODOLOGY Figure 2 illustrates the general framework of MOLGEN. The pre-training process ( 2.1) comprises two stages: molecular language syntax learning and domain-agnostic molecular prefix tuning. Then, a chemical feedback paradigm ( 2.2) is introduced to align the PLM with the anticipated chemical preferences in the downstream phase. Published as a conference paper at ICLR 2024 Figure 2: Overview of MOLGEN: pre-training (left) and downstream (right) stages. 2.1 DOMAIN-AGNOSTIC MOLECULAR PRE-TRAINING Figure 3: Random double mutations of SMILES and SELFIES derived from the same molecule, with blue markings indicating mutation locations. The likelihood of retaining a valid SMILES after a single mutation is 9.9%. For SELFIES, it s a consistent 100% (Krenn et al., 2020). SMILES and SELFIES are two molecular languages that associate a token sequence with a molecular structure. SMILES denotes molecules as chains of atoms, encapsulating branches within parentheses and signifying ring closures with corresponding number pairs. Despite its longstanding prominence in cheminformatics, SMILES is fundamentally flawed in that it lacks a mechanism to ensure the validity of molecular strings in terms of syntax and physical principles (Krenn et al., 2020). Hence, we employ SELFIES (Krenn et al., 2022), a fully robust molecular language that guarantees every possible combination of symbols in the alphabet corresponds to a chemically sound graph structure. In contrast to SMILES, SELFIES overcomes syntactic invalidity by mapping each token to a specific structure or reference, effectively resolving issues such as unbalanced parentheses or ring identifiers, as depicted in Figure 3. MOLGEN boasts a compact and specialized vocabulary size of 185. While modest in size, this vocabulary is already sufficient to ensure that the language model learns meaningful representations (Rives et al., 2021). Being the first of its kind to train language models utilizing SELFIES, our work necessitates a solid foundation for comprehending both the syntax and semantics of this language. To achieve a high-quality initialization for MOLGEN, we employ BART model (Lewis et al., 2020) during the first stage of pre-training, as shown in Figure 2. Firstly, we convert 100 million unlabeled molecules into SELFIES strings. The standardized representation of SELFIES facilitates the direct construction of an alphabet from the dataset, eliminating the need for a separate tokenizer to discern frequent substrings, thereby preventing the generation of nonsensical tokens. Secondly, we randomly select tokens from the original SELFIES string S = {s1, , sj, , sl} and replace them with a special token [MASK]. Finally, we encode the corrupted SELFIES using a bidirectional model and calculate the likelihood of S with a left-to-right autoregressive decoder. Formally, the cross-entropy between the decoder s output and the original input constitutes the reconstruction loss: s ptrue (s|S, S. ptrue refers to the one-hot distribution obtained under the standard maximum likelihood estimation: ptrue (s|S, S PS(Sj), we expect: ptrue(Si|S) > ptrue(Sj|S), Si, Sj S , PS(Si) > PS(Sj). (5) To incentivize the model to assign higher probabilities to candidates with desired properties, we utilize a rank loss (Liu et al., 2022). The rank loss arises when candidates with suboptimal properties obtain higher estimated probabilities compared to those with commendable properties: Lrank(S) = P j>i max (0, f (Sj) f (Si) + γij) , i < j, PS(Si) > PS(Sj), (6) Published as a conference paper at ICLR 2024 where γij = (j i) γ represents the margin multiplied by the difference in rank between the candidates, and f(S) = Pl t=1 log pθ (st | S, S