# towards_3d_moleculetext_interpretation_in_language_models__4e00923c.pdf Published as a conference paper at ICLR 2024 3D-MOLM: TOWARDS 3D MOLECULE-TEXT INTERPRETATION IN LANGUAGE MODELS Sihang Li13 Zhiyuan Liu2 Yanchen Luo13 Xiang Wang14 Xiangnan He13 Kenji Kawaguchi2 Tat-Seng Chua2 Qi Tian5 1University of Science and Technology of China 2National University of Singapore 3Mo E Key Laboratory of Brain-inspired Intelligent Perception and Cognition, USTC 4Institute of Dataspace, Hefei Comprehensive National Science Center 5Huawei Cloud Language Models (LMs) have greatly influenced diverse domains. However, their inherent limitation in comprehending 3D molecular structures has considerably constrained their potential in the biomolecular domain. To bridge this gap, we focus on 3D molecule-text interpretation, and propose 3D-Mo LM: 3D-Molecular Language Modeling. Specifically, 3D-Mo LM enables an LM to interpret and analyze 3D molecules by equipping the LM with a 3D molecular encoder. This integration is achieved by a 3D molecule-text projector, bridging the 3D molecular encoder s representation space and the LM s input space. Moreover, to enhance 3DMo LM s ability of cross-modal molecular understanding and instruction following, we meticulously curated a 3D molecule-centric instruction tuning dataset 3DMo IT. Through 3D molecule-text alignment and 3D molecule-centric instruction tuning, 3D-Mo LM establishes an integration of 3D molecular encoder and LM. It significantly surpasses existing baselines on downstream tasks, including moleculetext retrieval, molecule captioning, and more challenging open-text molecular QA tasks, especially focusing on 3D-dependent properties. We release our codes and datasets at https://github.com/lsh0520/3D-Mo LM. 1 INTRODUCTION The advancement of Language Models (LMs) (Devlin et al., 2019; Open AI, 2023b; Touvron et al., 2023a) has triggered a series of remarkable innovations across multiple disciplines (Zhao et al., 2023). Notably, LMs excel at text-based molecule understanding tasks, such as question-answering (QA) in the chemical and medical domains (Taylor et al., 2022), by pretraining on extensive biochemical literature. Recognizing the potential of LMs in harnessing extensive biochemical knowledge for molecule-relevant tasks, molecule-text modeling emerges as a new research direction (Edwards et al., 2021; 2022). Previous works have been dedicated to harmonizing texts with 1D molecular sequences (Zeng et al., 2022; Taylor et al., 2022) and 2D molecular graphs (Liu et al., 2023b; Su et al., 2022; Liu et al., 2022a), aiding in tasks like molecule-text retrieval and molecule captioning. However, they mostly leave 3D molecular structures untouched, which are crucial to understanding molecular dynamics, protein-ligand interactions, enzymatic functions, and a range of other biomolecular phenomena (Karplus & Mc Cammon, 2002; Jorgensen, 2004). To bridge this gap, we focus on 3D molecule-text interpretation, with the goal of enabling an LM to interpret and analyze 3D molecular structures through text generation. Given the recent successes of 3D molecular encoders in tasks like molecule property prediction, docking, and conformation prediction (Zhou et al., 2023; Lu et al., 2023; Fang et al., 2022), it is promising to incorporate one as an LM s perception module for 3D molecules. Upon examination of existing literature (Dai et al., 2023; Hong et al., 2023; Chung et al., 2022), we identify two key challenges to seamlessly integrate a 3D molecular encoder into an LM for 3D molecule-text interpretation: 3D Molecule-Text Alignment maps 3D molecular representations into the input textual space where the LM can understand. Equal contribution. Correspondence to Xiang Wang and Xiangnan He. {xiangwang1223, xiangnanhe}@gmail.com Published as a conference paper at ICLR 2024 3D Molecular Encoder (Uni-Mol) 3D Moleucle-Text Projector (Q-Former) Caffeic acid is a hydroxycinnamic acid C1=CC(=C(C=C1 C=CC(=O)O)O)O Molecule-Text Retrieval Query: This molecule is a hydroxycinnamic acid that is cinnamic 3D-Mo LM: 3D-Molecular Language Modeling Molecule Captioning Molecule Question Answering Instruction: Describe this molecule: CC(=O) OC1=CC=CC=C1C(=O )O <3D molecular embeds> Response: Acetylsalicylic acid appears as odorless white crystal Instruction: What is the molecular structure of this molecule? CC1C(C2=C(C1 =O)C(=C(C(=C2)C)CCO)C) O <3D molecular embeds> Response: This molecule s structure contains an indanone core, which is a bicyclic Task Description: Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. Instruction: Please provide the SCF Energy value of this molecule. CC1=CC-(=NN1C2=CC=C(C=C 2)CC#N)C <3D molecular embeds> Response: The SCF energy for the input molecule is -18162.11ev. Lo RA Language Model (Llama 2) Figure 1: Demonstration of 3D-Mo LM. 3D-Mo LM is a general-purpose molecular LM that can be applied for molecule-text retrieval, molecule captioning, and molecular QA tasks. Flame denotes tunable modules, while snowflake indicates frozen modules. 3D Molecule-centric Instruction Tuning fine-tunes the model to follow human instructions on 3D molecule relevant tasks. To address these challenges, we propose 3D-Mo LM: 3D-Molecular Language Modeling, as depicted in Figure 1. Specifically, it consists of two key components: 1) a 3D molecule-text projector for 3D molecule-text alignment, which aligns the latent representation spaces between the 3D molecular encoder and the LM, and 2) a dataset for 3D Molecule-centric Instruction Tuning, 3D-Mo IT, as shown in Figure 3. 3D-Mo IT enhances the model s ability to follow human instructions and discern 3D-dependent properties of molecules. For 3D molecule-text alignment, we employ Q-Former (Li et al., 2023) as the 3D molecule-text projector, drawing inspiration from leading vision-language modeling methods (Zhu et al., 2023; Dai et al., 2023). Given a molecule s 3D structure, Q-Former converts it into tokens, which serve as 1D soft prompts (Li & Liang, 2021), harmonizing seamlessly with the language space of the LM. This translation facilitates the LM s interpretation of 3D molecular structures. To cultivate the Q-Former s alignment capability, two training stages are conducted the first stage focuses on 3D molecule-text representation learning, while the second stage optimizes for 3D molecule-text alignment. As depicted in Figure 3, these two training stages are facilitated by our collected 316K molecule-text pairs from Pub Chem (Kim et al., 2021). To promote the 3D molecule-text alignment process, we manipulate the dataset by generating the 3D conformations based on SMILES using RDKit (Landrum et al., 2013) and enriching the molecular descriptions with GPT-3.5 (Open AI, 2023a). We will detail the collection and enrichment of Pub Chem Dataset in Section 2.2.1 and Appendix B. Upon aligning 3D molecules with texts, we conduct instruction tuning using our curated dataset 3D-Mo IT. It is designed to cultivate 3D-Mo LM s ability to follow instructions, and to enhance its perception of 3D-dependent molecule properties. Specifically, 3D-Mo IT is sourced from two databases: 1) Pub Chem, which offers a wide range of molecular properties, origins, and applications, and 2) Pub Chem QC (Nakata, 2015), which specializes in 3D-dependent molecular properties. As shown in Figure 3, for the Pub Chem portion, we leverage GPT-3.5 to generate QA pairs based on their descriptions. Yet, molecular properties collected from Pub Chem (e.g., molecular weight and Log P) can be largely inferred from 1D or 2D molecular data. To enhance 3D-Mo IT s perception of 3D molecular structures, we further incorporate data from Pub Chem QC, which includes 3D-dependent molecule properties (e.g., HOMO and LUMO; Mc Quarrie & Simon (1997)). We fill these properties into a set of text templates, transforming them into instruction tuning formats, as Figure 1 illustrates. Our contributions can be summarized as follows: We propose 3D-Mo LM, a new framework for 3D molecule-text interpretation. 3D-Mo LM employs a 3D molecule-text projector to bridge the modality gap between a 3D molecular encoder and an LM, enabling the LM to perceive 3D molecular structures. We curate 3D-Mo IT, a 3D molecule-centric instruction tuning dataset. We extract and transform data from Pub Chem and Pub Chem QC to an instruction following format, to cultivate 3D-Mo LM s ability in instruction following and 3D molecule-text interpretation. 3D-Mo LM achieves state-of-the-art performances in extensive downstream tasks. Notably, on the Pub Chem Dataset, for molecule-text retrieval and molecule captioning, it outperforms baselines by Published as a conference paper at ICLR 2024 Molecule-Text Matching Self-att Molecule-Text Contrasting Self-att Molecule Captioning Self-att Text branch 3D molecule 3D Molecular Feed-forward Feed-forward Learnable query tokens Text transformer Caffeic acid is a hydroxy Molecule transformer Molecule-Text Molecule Captioning Molecule-Text Contrasting i. Q-Former s architecture. ii. Self-attention masking strategy for different tasks. 3D Molecule-Text Projector: Q-Former Shared Self-attention (a) Stage 1. The 3D molecule-text projector (i.e., Q-Former) with the attached frozen 3D molecule encoder is optimized for 3D molecule-text representation learning. Stage 1 involves three training objectives: molecule-text matching, molecule-text contrasting, and molecule captioning. 3D Molecular Encoder (Uni-Mol) 3D Moleucle-Text Projector (Q-Former) Caffeic acid is a yellow solid that slightly C1=CC(=C(C=C1 C=CC(=O)O)O)O Lo RA Language Model (Llama 2) 3D Molecule Input Text Output What are the physical properties of caffeic acid? (b) Stage 2 & 3. 3D-Mo LM is trained to perform 3D molecule-to-text generations given 3D molecular tokens (extracted by the Q-former) and 1D textual prompt tokens. Figure 2: Illustration of 3D-Mo LM s architectures at different stages. 20% accuracy and 6.47 ROUGE-L, respectively. Further, it surpasses the baselines with 1D or 2D molecular perceptions on open-text QA tasks, especially on 3D-dependent properties, verifying the capability of 3D molecule-text interpretation. 2 3D-MOLM: 3D MOLECULAR LANGUAGE MODELING 3D-Mo LM incorporates a 3D molecular encoder into an LM, aiming to align 3D molecular geometries with textual concepts and facilitate a comprehensive cross-modal understanding of molecules. Consequently, 3D-Mo LM is able to read 3D molecular structures, amplifying its molecular understanding and facilitating 3D-text interpretation. Our idea draws from related works in molecule-text modeling, multi-modal instruction tuning, and multi-modal LMs. See Appendix A for a comprehensive literature review. Here we delve into 3D-Mo LM s architecture and its training pipeline. 2.1 MODEL ARCHITECTURE 3D-Mo LM s architecture consists of three key components: 1) a 3D molecular encoder, focusing on encoding 3D molecular structures; 2) a 3D molecule-text projector, aiming to map the 3D molecular encoder s representations to the input space of the LM; and 3) an LM, which specializes in text generation and is later adapted for understanding 3D molecular structures. 3D Molecular Encoder. We adopt Uni-Mol (Zhou et al., 2023) as our 3D molecular encoder fmol. Specifically, Uni-Mol is pretrained on a large molecule dataset comprising 209M 3D molecular conformations. Formally, let m = (V, h, C) denote a molecule, where V and h separately represent atomic nodes and their features, and C R|V| 3 collects the 3D coordinates of nodes. In Uni-Mol, the representation for each pair of atoms is initialized using invariant spatial positional encoding derived from 3D coordinates C. This encoding, grounded in the pair-wise Euclidean distances between atoms, ensures that the representation remains consistent regardless of global rotations or translations. Subsequently, representations of atoms and atom pairs engage in a self-attention mechanism, generating the molecular representation with 3D spatial information. Overall, the 3D molecular encoder fmol performs molecule encoding procedure to obtain the atomic representations: X = [x1, x2, ..., x|V|] = fmol(m), (1) where xi corresponds to the representation of the i-th atom. Published as a conference paper at ICLR 2024 Molecule Desciption: (2S,3S)-pterosin C is a complex organic molecule belonging to the indanone family. It is naturally found in various organisms such as Pteridium revolutum and Pteris bella. Firstly, it is important to note its stereochemistry denoted by (2S,3S), which describes the spatial arrangement of its atoms. This indicates that the molecule has two chiral centers, resulting in four possible stereoisomers. ( ) This molecule's structure contains an indanone core, which is a bicyclic structure consisting of a benzene ring fused to a cyclopentane ring. The molecule also features additional substituents, including a carbonyl group, a hydroxyl group, and a long side chain comprised of carbon and hydrogen atoms. These structural features contribute to several chemical and physical properties. The carbonyl group allows ( ) Instruction: What is the molecular structure of this molecule? Input: CC1C(C2=C(C1=O)C(=C(C(=C2)C)CCO)C)O <3D molecular embeds> Response: This molecule s structure contains an indanone core, which is a bicyclic structure consisting of a benzene ring fused to a cyclopentane ring. Pub Chem Dataset Pub Chem: Instruction Format -1.529 5 -24256 -7.271 4 -6795 1.847 -6.188 -15575 -2.239 cid homo lumo Pub Chem QC: Instruction Format 3D-Mo IT Dataset Instruction: Determine the LUMO value of this molecule. Input: C1=CC(C(C(=C1)C(=O)O)O)O <3D molecular embeds> Response: The LUMO value for the input molecule is -2.239ev. Task Description: Below is an instruction that describes a task, paired with an input that provides further context. Write a response that completes the request. 1D molecular sequence CC1C(C2=C(C1=O)C(=C (C(=C2)C)CCO)C)O 3D molecular structure Stage 1: 3D Molecule-Text Representation Learning Stage 2: 3D Molecule-Text Alignment via Generative Learning Stage 3: 3D Molecule-centric Instruction Tuning 3D Molecular 3D Moleucle-Text 3D Molecular 3D Moleucle-Text Model Lo RA 3D Molecular 3D Moleucle-Text Model Lo RA Model Architecture Dataset Figure 3: Illustration of the model architectures (upper part) and the dataset usage (bottom part) for the three training stages. Pub Chem is used for the stage 1 (i.e., 3D molecule-text representation learning) and stage 2 (i.e., 3D molecule-text alignment via generative learning). 3D-Mo IT is used for 3D molecule-centric instruction tuning. Texts in the same color indicate the same information source. 3D Molecule-Text Projector. Taking inspiration from the leading vision-language models (Li et al., 2023; Dai et al., 2023), we architect the 3D molecule-text projector fpro as a Querying Transformer (i.e., Q-Former) and initialize it from the Sci-BERT s checkpoint (Beltagy et al., 2019). As illustrated in Figure 2a, Q-Former has two transformers with shared self-attention layers: one molecule transformer for processing 3D molecule features, and one text transformer for processing texts. The text transformer follows the same architecture of BERT (Devlin et al., 2019), while the molecule transformer adds cross-attention modules between the modules of self-attention and feed-forward to extract molecule features. Specifically, the molecule transformer maintains K learnable query tokens. Given 3D molecule input, the query tokens can interact with the 3D molecular encoder s representations through the cross-attention modules. Therefore, the K query tokens output representations contain molecule information, represented as M = [m1, m2, ..., m K]. The 3D molecule-text projector s forward function can be written as: M = [m1, m2, ..., m K] = fpro(X). (2) Language Model (LM). We employ Llama2 (Touvron et al., 2023b) as our base LM flm to leverage its powerful text generation capability and internal chemistry knowledge. Although pretrained for general-purpose usage, the extensive biomedical literature in Llama 2 s pretraining corpus enables it to efficiently interpret 1D molecular sequences (e.g., SMILES) and proficiently address essential QA tasks that are relevant to molecular understanding. In this work, we let Llama2 process mixed token sequences that includes both textual tokens and 3D molecular tokens, which is detailed in Section 2.2.1. Formally, we denote a mixed token sequence that include l textual and molecular tokens as Z = [z1, z2, ..., zl]. Further, the LM adopts a causal mask to generate textual response ˆZ with length n, where the prediction for the i-th token, ˆzi, is dependent on its previous tokens: ˆZ = [ˆzl+1, ˆzl+2, ..., ˆzl+n], ˆzi = flm(Z