# hierarchical_graph_tokenization_for_moleculelanguage_alignment__97abe082.pdf Hierarchical Graph Tokenization for Molecule-Language Alignment Yongqiang Chen 1 2 3 Quanming Yao 4 Juzheng Zhang 5 James Cheng 3 Yatao Bian 6 Recently, there has been a surge of interest in extending the success of large language models (LLMs) from texts to molecules. Most existing approaches adopt a graph neural network to represent a molecule as a series of node tokens for molecule-language alignment, which, however, have overlooked the inherent hierarchical structures in molecules. Notably, higherorder molecular structures contain rich semantics of functional groups, which encode crucial biochemical functionalities of the molecules. We show that neglecting the hierarchical information in tokenization will lead to subpar moleculelanguage alignment and severe hallucination. To address this limitation, we propose HIerarchical Grap H Tokenization (HIGHT). HIGHT employs a hierarchical graph tokenizer that encodes the hierarchy of atom, motif, and molecular levels of informative tokens to improve the molecular perception of LLMs. HIGHT also adopts an augmented instruction tuning dataset, enriched with the hierarchical graph information, to further enhance the molecule-language alignment. Extensive experiments on 14 real-world benchmarks verify the effectiveness of HIGHT in reducing hallucination by 40%, and significant improvements in various molecule-language downstream tasks. The project is available at https: //higraphllm.github.io/. 1. Introduction Large language models (LLMs) have demonstrated impressive capabilities in understanding and processing natural Most of the works were done when Yongqiang Chen was a Ph D student at CUHK. 1MBZUAI 2Carnegie Mellon University 3The Chinese University of Hong Kong 4Tsinghua University 5University of Maryland, College Park 6Department of Computer Science, National University of Singapore. Correspondence to: Yatao Bian . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). languages (Radford et al., 2019; Open AI, 2022; Touvron et al., 2023a; Bubeck et al., 2023). Recently, there has been a surge of interest in extending the capabilities of LLMs to graph-structured data (Jin et al., 2023; Li et al., 2023d; Wei et al., 2024; Mao et al., 2024; Fan et al., 2024), particularly molecular graphs (Zhao et al., 2023; Cao et al., 2023). Inspired by the success of large vision-language models (Zhang et al., 2024b; Liu et al., 2023a), recent efforts in developing large graph-language models (LGLMs) typically adopt a graph neural network (GNN) (Xu et al., 2019) to tokenize molecules as a series of node embeddings (or node tokens), and then leverage an adapter such as a Multi-layer perceptron (MLP) or a Q-former (Li et al., 2023a) to transform the node tokens into those compatible with LLMs (Fan et al., 2024). To bridge the gap between the graph and language modalities, LGLMs will undergo a molecule-language instruction tuning with the molecular graph and the corresponding captions describing the molecules (Jin et al., 2023; Li et al., 2023d; Fan et al., 2024). Despite recent progress, the tokenization in existing LGLMs neglects the essential hierarchical structures inherent in molecular graphs. In particular, in molecular graphs, the high-order substructures, such as motifs or functional groups, encode rich semantics of the biochemical functionalities of the molecules (Milo et al., 2002; Bohacek et al., 1996; Sterling & Irwin, 2015). For example, the presence of a hydroxide functional group ( -OH ) often indicates a higher water solubility. Therefore, such substructural cues are essential for enabling LLMs to reason about the molecules in a chemically meaningful way. However, existing LGLMs mostly tokenize molecules solely at the atom (node) level, and feed LLMs with only node-level tokens. Consequently, it requires LLMs to implicitly infer the underlying substructures during the instruction tuning stage. The absence of the critical substructures not only increases the unnecessary burdern on the LLMs, but also leads to misaligned representations and a higher likelihood of hallucinations in downstream tasks. To quantify the issue, we introduce a diagnostic benchmark, called Motif Hallu, which evaluates the perception ability of LGLMs about the existence of common functional groups. Surprisingly, we find that existing LGLMs often produce false-positive predictions (i.e., keep answering Yes for any functional groups), highlighting a critical limitation in current graph to- Hierarchical Graph Tokenization for Molecule-Language Alignment Large Language Model Text Tokenizer Node Motif "Can you tell me more about this molecule?" This molecule is a cyclohexadienecarboxylic acid having the C=C bonds at the 1and 3-positions This molecule has 1 carboxylic acids group, and 2 side-chain hydroxyls groups Motif-level (a) Overview of the HIGHT framework. (b) Summary of performance. Figure 1. (a) Illustration of HIGHT: Given a molecule (i.e., Pub Chem ID 3, 5,6-Dihydroxycyclohexa-1,3-diene-1-carboxylic acid), HIGHT detects the motifs and incorporates the supernodes for each motif (The whole graph is also considered as a super motif .). Then, HIGHT tokenizes the molecule into both node-level (i.e., atoms) and motif-level (i.e., functional groups) tokens. The hierarchical view enables LLMs to align the molecular structures and the language descriptions of the molecule better. (b) Performance Overview: HIGHT significantly reduces the hallucination of LGLMs and improves the downstream performance across various molecule-centric tasks. Due to the heterogeneity of the evaluation metrics in each task, we perform some transformations on the numerical values. In Motif Hallu, we report the macro F1 scores. For Property Classification and Molecular Caption, we report the averaged scores of all the subtasks or submetrics. For Property Regression, we normalize the values to the range between 1 and 100, i.e., for a, the reported number is 0.5/a. For Chemical Reaction Prediction, we report the averaged values of BLEU, RDK, MACCS, and MORGAN. kenization strategies (Sec. 3.2). This observation motivates the following research question: Is there a feasible approach to integrate the intrinsic hierarchical molecular information into LLMs? To tackle the problem, we propose a new molecule-language alignment strategy called HIerarchical Grap H Tokenization (HIGHT). As illustrated in Fig. 1, HIGHT adopts a hierarchical graph tokenizer and a hierarchical molecular instruction tuning dataset to facilitate a better alignment of molecule and language modalities. Specifically, inspired by the success of hierarchical GNNs in molecular representation learning (Zhang et al., 2021; Zang et al., 2023; Inae et al., 2023; Luong & Singh, 2023), HIGHT transforms the original molecular graph into a hierarchical graph with motif-level and molecule-level nodes added in. Then, HIGHT employs a Vector Quantized-Variational Auto Encoder (VQVAE) to obtain atom-level, motif-level, and molecule-level tokens separately with the self-supervised tasks (Zang et al., 2023). In addition, to further encourage the encoding and alignment of hierarchical information, HIGHT augments the original molecular instruction tuning dataset with motif-level descriptions. Our contributions can be summarized as follows: To the best of our knowledge, we are the first to incorporate the hierarchical graph information into LGLMs, with the consideration of both the architecture-level and the instruction tuning data. To facilitate the molecule-language alignment study, we also propose the first hallucination benchmark Motif Hallu, synthesized through question-answering based on common functional groups. We conduct extensive experiments with 14 real-world benchmarks. The results show that HIGHT significantly reduces the hallucination on Motif Hallu by up to 40% and consistently improves the performances on downstream molecule-language tasks. Hence, HIGHT together with Motif Hallu and Hi Pub Chem, lay the solid foundation for developing graph foundation models via graph-language alignment. 2. Preliminaries Large Graph-Language Models. As LLMs have demonstrated great capabilities across a wide range of natural language tasks, there has been an increasing interest in extending LLMs to broader applications where the text data are associated with the structure information (i.e., graphs) (Jin et al., 2023; Li et al., 2023d; Wei et al., 2024; Mao et al., 2024; Fan et al., 2024). A graph can be denoted as G = (V, E) with a set of n nodes v V and a set of m edges (u, v) E. Each node u has node attributes as xu Rd and each edge (u, v) has edge attributes eu,v Rde. A number of LGLMs have been developed to process graph-text associated data D = {G, c}, where c = [c1, ..., clc] is to the caption of the graph G. For nodecentric tasks, ci will associate with the nodes (Tang et al., 2023), while in this paper we focus on graph-centric tasks, Hierarchical Graph Tokenization for Molecule-Language Alignment i.e., molecules and molecular captions (Liu et al., 2023c). Usually, an l-layer GNN is employed to encode a graph as: h(l) u = COM(h(l 1) u , AGG({(h(l 1) u , h(l 1) v )|v N(u)})), (1) where h(l) u Rh refers to the node embedding of node u after l layers of GNN, AGG( ) is the aggregation function (e.g., mean) among the information from neighbors of node u, and COM is the operator for combining information of node u with its neighbors N(u) (e.g., concatenation). Then, after l message passing iterations, the graph-level embedding can be obtained as: h G = READOUT {h(l) u |u V} , (2) where READOUT( ) is a pooling operator (e.g., mean pooling) among all the node embeddings. With the representations of the nodes and graphs, LGLMs can fuse the graph and language information in various ways, such as transforming into natural languages describing the graphs (Fatemi et al., 2024), or neural prompts within the LLMs (Tian et al., 2024). In addition, the embeddings can also be leveraged to post-process the LLM outputs (Liu et al., 2024b). Orthogonal to different fusion mechanisms, in this work, we focus on transforming graph embeddings into input tokens of LLMs, which can be formulated as (Tang et al., 2023; Chen et al., 2024a; Liu et al., 2023c; Zhao et al., 2023; Cao et al., 2023; Li et al., 2024): pθ(a|q, h) = Yla i=1pθ(ai|q, fn(h), a in the molecule? Then, we examine the outputs from LGLM meaning Yes or No . For each molecule, we construct questions with positive answers for all kinds of functional groups detected in the molecule, and questions with negative answers for randomly sampled 6 functional groups from the remaining. Hence Motif Hallu consists of 23, 924 questions. While it is easy to scale up Motif Hallu with more molecules and functional groups, we find that the current scale is already sufficient to demonstrate the issue (Table 2). 4. Hierarchical Graph Tokenization To improve the molecule-language alignment, we propose a new strategy called HIerarchical Grap H Tokenization (HIGHT), which contains a hierarchical graph tokenizer and a hierarchical molecular instruction tuning dataset to augment the inputs with hierarchical information. 4.1. Hierarchical Graph Tokenizer Inspired by the success of hierarchical GNNs (Zhang et al., 2021; Zang et al., 2023), we transform the original molecular graph G into a hierarchical graph G with motif-level and molecule-level nodes added in. Specifically, we leverage the Breaking of Retrosynthetically Interesting Chemical Substructures (BRICS) algorithm (Degen et al., 2008)2 to detect and inject a set of k + 1 supernodes, denoted as M = {M(1), ..., M(k), M(k+1)}, with k motifs and the original molecule M(k+1) = G. Furthermore, denoting the set of nodes and edges in M(i) as V(i) m and E(i) m , respectively, we augment the original molecular graph G as G with augmented nodes V and edges E : V = V {v(1) m , ..., v(k+1) m }, E = E ( k+1 i=1 E(i) ma), (6) where v(i) m is the motif super nodes added to the original molecule, and E(i) ma = u V(i) m {(u, v(i) m )} are the augmented edges connecting to the motif super node from nodes within the corresponding motif. We employ separate VQVAEs for atoms and motifs to learn meaningful code embeddings with 1https://github.com/rdkit/rdkit/blob/ master/Data/Functional Groups.txt 2Note that HIGHT possesses a high degree of extensibility and can be augmented by incorporating advanced motif extraction techniques (such as (Zhang et al., 2021)). Hierarchical Graph Tokenization for Molecule-Language Alignment Large Language Model No, the hydroxide group consists of -OH, but "Is there a hydroxide in this molecule?" Node Tokens (a) Node-centric tokenization. Large Language Model Yes, the hydroxide group is present in the molecule O N H =O -OH "Is there a hydroxide in this molecule?" Node Tokens Motif Tokens (b) HIGHT tokenization. Figure 2. Illustration of hallucination caused by node-centric tokenization. With only node-level tokens, LLMs have to relate the nodes within a specific functional group to align useful molecular structures with the corresponding language descriptions. Yet, due to the arbitrary order of atoms and position biases in LLMs, it is hard to recognize each functional group, leading to severe hallucinations. several self-supervised learning tasks. The reconstructed attributes in Eq. 4 include atom types at the atom-level and the number of atoms at the motif-level (Zang et al., 2023). Merely feeding the motif tokens with node tokens to LLMs still can not help distinguish the motifs from atoms properly, hence we propose to further attach positional encodings p to all of the tokens. We choose to use Laplacian positional embeddings (Dwivedi et al., 2020) while one could also adopt other variants (Ying et al., 2021). Since different types of tokens contain distinct semantics, we adopt separate adapters for different types of tokens. Denoting the motif tokens as h(i) m for motif M(i), generation with HIGHT is: pθ(a|q, h, hm) = Yla i=1pθ(ai|q, fn(h1), ..., fn(hn), fm(h(1) m ), ..., fg(h(k+1) m ), a