# graph_generative_pretrained_transformer__ddd01856.pdf Graph Generative Pre-trained Transformer Xiaohui Chen 1 Yinkai Wang 1 Jiaxing He 2 Yuanqi Du 3 Soha Hassoun 1 Xiaolin Xu 2 Li-Ping Liu 1 Graph generative models, which can produce complex structures that resemble real-world data, serve as essential tools in domains such as molecular design and network analysis. While many existing generative models rely on adjacency matrices, this work introduces a token-based approach that represents graphs as token sequences and generates them via next-token prediction, offering a more efficient encoding. Based on this methodology, we propose the Graph Generative Pre-trained Transformer (G2PT), an autoregressive Transformer architecture that learns graph structures through this sequence-based paradigm. To extend G2PT s capabilities as a general-purpose foundation model, we further develop fine-tuning strategies for two downstream tasks: goal-oriented generation and graph property prediction. Comprehensive experiments on multiple datasets demonstrate G2PT s superior performance in both generic graph generation and molecular generation. Additionally, the good experiment results also show that G2PT can be effectively applied to goal-oriented molecular design and graph representation learning. The code of G2PT is released at https://github.com/tuftsml/G2PT. 1. Introduction Graph generation has emerged as a crucial task across diverse fields such as chemical discovery and social network analysis, thanks to its ability to model complex relationships and produce realistic, structured data (Du et al., 2021; Zhu et al., 2022). Early generation methods such as Deep GMG (Li et al., 2018) and Graph RNN (You et al., 2018b) model graphs with 1Tufts University 2Northeastern University 3Cornell University. Correspondence to: Xiaohui Chen , Li-Ping Liu . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). sequential models. These approaches employed sequential frameworks (e.g., RNNs or LSTMs (Sherstinsky, 2020)) to generate graphs sequentially. For instance, Graph RNN generates adjacency matrix entries step by step. For undirected graphs, it only needs to generate the lower triangular part of the adjacency matrix. Deep GMG frames graph generation as a sequence of actions (e.g., add-node, add-edge), and utilizes an agent-based model to learn the action trajectories. Recent advances in graph generative models have primarily focused on permutation-invariant methods, particularly diffusion-based approaches (Ho et al., 2020; Austin et al., 2021). For example, models like EDP-GNN (Niu et al., 2020) and GDSS (Jo et al., 2022a) learn from adjacency matrices as continuous values. Di Gress (Vignac et al., 2022) and EDGE (Chen et al., 2023) employ discrete diffusion, treating node types and all node pairs (edges and non-edges) as categorical variables. These models start from a random or a fixed adjacency matrix and run denoising steps to sample an adjacency matrix from the target graph distribution. They specify exchangeable (permutation-invariant) distributions over graphs by assigning the same probability to adjacency matrices of the same graph. However, achieving the permutation-invariant property has a price: the underlying neural network needs to be permutation-invariant as well, limiting the architecture choice to graph neural networks only. Discrete diffusion has an additional limitation: it samples matrix entries independently at each denoising step, making it challenging to learn the true distribution when the number of denoising steps is insufficient (Lezama et al., 2022; Campbell et al., 2022). In recent years, the revolutionary success of large language models (Achiam et al., 2023; Dubey et al., 2024) shows the power of autoregressive Transformers and also inspired the application of these models in other fields such as image generation (Esser et al., 2021). In this work, we revisit the sequential approach to graph generation and introduce a novel token-based encoding scheme for representing sparse graphs as sequences. This new encoding strategy unlocks the potential of Transformer architectures for graph generation. We train autoregressive Transformers to generate graphs by predicting token sequences, resulting in our proposed method: Graph Generative Pre-trained Transformer (G2PT). While G2PT does not maintain permutation invariance, we argue that its capacity to learn accurate graph Graph Generative Pre-trained Transformer Model (Rep.) Likelihood Illustration #Network Calls #Variables p(A) or p(E) Diffusion(A) p(AT ) t=1 p(At 1|At) T O(Tn2) Intractable Sequential(A) j=1 p(Ai,j|A ω2 > . . . > ωk = ω, we obtain a sequence of fine-tuned models by iteratively constructing fine-tuned datasets using the model trained from the previous tolerance. The SBS algorithm combined with RFT is shown in Alg. 3. Reinforcement learning. Denote a target-relevant reward function rz (G), we consider a KL-regularized reinforcement learning problem: ϕ = arg max ϕ Epϕ(s) rz (s) ρ1KL pϕ(s) pθ(s) . Here rz (s) = rz (G) as s uniquely decides G. The KL divergence KL( ) prevents the target model from deviating too much from the pre-trained model. We choose Proximal Policy Optimization (PPO) (Schulman et al., 2017) to effectively train the target model pϕ without sacrificing stability. The token-level reward is rz (s) only at the last token and zero otherwise: R([s 0.4) 0 2 4 6 8 10 0 6 Data Pre-trained RFT(< 3.0) RFTSBS1(< 1.5) 0.0 0.2 0.4 0.6 0.8 1.0 0 12 Data Pre-trained RFT(> 0.2) RFTSBS1(> 0.4) RFTSBS2(> 0.6) RFTSBS3(> 0.8) QED Score SA Score GSK3β Score (a) Rejection sampling fine-tuning (with self-bootstrap) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 3.0 Data Pre-trained PPO 0 2 4 6 8 10 0.0 Data Pre-trained PPO 0.0 0.2 0.4 0.6 0.8 1.0 0 12 Data Pre-trained PPO QED Score SA Score GSK3β Score (b) Reinforcement learning framework (PPO) Figure 2. Goal-oriented molecule generation using QED, SA and GSK3β scores. Top row (a) shows the results using RFT, and bottom row (b) shows the results using RL. The quantitative results are presented in Table 4. On MOSES, G2PT surpasses other state-of-the-art models in validity, uniqueness, FCD, and SNN metrics. We introduce the details for metrics in appendix B.6. Notably, the FCD, SNN, and scaffold similarity (Scaf) evaluations compare generated samples to a held-out test set, where the test molecules have scaffolds distinct from the training data. Although the scaffold similarity score is relatively low, the overall performance indicates that G2PT achieves a better goodness of fit on the training set. G2PT also delivers strong performance on the Guaca Mol and QM9 datasets. We additionally provide qualitative examples from the MOSES and Guaca Mol datasets in the table. 5.5. Goal-oriented Generation In addition to distribution learning which aims to draw independent samples from the learned graph distribution, goaloriented generation is a major task in graph generation that aims to draw samples with additional constraints or preferences and is key to many applications such as molecule optimization (Du et al., 2024). We validate the capability of G2PT on goal-oriented generation by fine-tuning the pre-trained model. Practically, Graph Generative Pre-trained Transformer BBBP Tox21 Tox Cast SIDER Clin Tox MUV HIV BACE Avg. Attr Mask (Hu et al., 2020a) 70.2 0.5 74.2 0.8 62.5 0.4 60.4 0.6 68.6 9.6 73.9 1.3 74.3 1.3 77.2 1.4 70.2 Info Graph (Sun et al., 2020) 69.2 0.8 73.0 0.7 62.0 0.3 59.2 0.2 75.1 5.0 74.0 1.5 74.5 1.8 73.9 2.5 70.1 Context Pred (Hu et al., 2020a) 71.2 0.9 73.3 0.5 62.8 0.3 59.3 1.4 73.7 4.0 72.5 2.2 75.8 1.1 78.6 1.4 70.9 Graph CL (You et al., 2021) 67.5 2.5 75.0 0.5 62.8 0.2 60.1 1.3 78.9 4.2 77.1 1.0 75.0 0.4 68.7 7.8 70.6 Graph MVP (Liu et al., 2022a) 68.5 0.2 74.5 0.0 62.7 0.1 62.3 1.6 79.0 2.5 75.0 1.4 74.8 1.4 76.8 1.1 71.7 Graph MAE (Hou et al., 2022b) 70.9 0.9 75.0 0.4 64.1 0.1 59.9 0.5 81.5 2.8 76.9 2.6 76.7 0.9 81.4 1.4 73.3 G2PTsmall (No pre-training) 60.7 0.3 66.4 0.5 57.0 0.3 61.6 0.2 67.8 1.1 45.8 8.5 70.1 7.5 68.8 1.3 62.3 G2PTbase (No pre-training) 56.5 0.2 67.4 0.4 57.9 0.1 60.2 2.8 71.0 5.6 60.1 1.3 72.7 1.1 73.4 0.3 64.9 G2PTsmall 68.5 0.5 74.7 0.2 61.2 0.1 61.7 1.0 82.3 2.2 74.9 0.1 75.7 0.4 81.3 0.5 72.5 G2PTbase 71.0 0.4 75.0 0.3 63.0 0.5 61.9 0.2 82.1 1.1 74.5 0.3 76.3 0.4 82.3 1.6 73.3 Table 5. Results for molecule property prediction in terms of ROC-AUC. We report mean and standard deviation over three runs. we employ the model pre-trained on Guaca Mol dataset and select three commonly used physiochemical and bindingrelated properties: quantitative evaluation of druglikeness (QED), synthesis accessibility (SA), and the activity against target protein Glycogen synthase kinase 3 beta (GSK3β), detailed in Appendix B.3. The property oracle functions are provided by the Therapeutics Data Commons (TDC) package (Huang et al., 2022). As discussed in 4.1, we employ two approaches for finetuning: (1) rejection sampling fine-tuning and (2) reinforcement learning with PPO. Figure 2 shows that both methods can effectively push the learned distribution to the distribution of interest. Notably, RFT, with up to three rounds of SBS, significantly shifts the distribution towards a desired one. In contrast, PPO, despite biasing the distribution, suffers from the over-regularization from the base policy, which aims for training stability. In the most challenging case (GSK3β), PPO fails to sampling data with high rewards. Conversely, RFT overcomes the barrier in the second round (RFTSBS1), where its distribution becomes flat across the range and quickly transitions to a high-reward distribution. 5.6. Predictive Performance on Downstream Tasks We conduct experiments on eight graph classification benchmark datasets from Molecule Net (Wu et al., 2018a), strictly following the data splitting protocol used in Graph MAE (Hou et al., 2022a) for fair comparison. A detailed description of these datasets is provided in Appendix B.4. For downstream fine-tuning, we initialize G2PT with parameters pre-trained on the Guaca Mol dataset, which contains molecules with up to 89 heavy atoms. We also provide results where models are not pre-trained. As summarized in Table 5, G2PT s graph embeddings demonstrate consistently strong (best or second-best) performance on seven out of eight downstream tasks, achieving an overall performance comparable to Graph MAE, a leading self-supervised learning (SSL) method. Notably, while previous SSL approaches leverage additional features such Validity (%) MOSES Guaca Mol QM9 #Sequences per graph MOSES Guaca Mol QM9 Figure 3. Model and data scaling effects. as 3D information or chirality, G2PT is trained exclusively on 2D graph structural information. Overall, these results indicate that G2PT not only excels in generation but also learns effective graph representations. 5.7. Scaling Effects We analyze how scaling the model size and data size will affect the model performance using the three molecular datasets. We use the validity score to quantify the model performance. Results are provided in Figure 3. For model scaling, we additionally train G2PTs with 1M, 707M, and 1.5B parameters. We notice that as model size increases, validity score generally increases and saturates at some point, depending on the task complexity. For instance, QM9 saturates at the beginning (1M parameters) while MOSES and Guaca Mol require more than 85M (base) parameters to achieve satisfying performance. For data scaling, we generating multiple sequences from the same graph to improve the diversity of the training data. The number of augmentation per graph is chosen from {1, 10, 100}. As shown, one sequence per graph is insufficient to train Transformers effectively, and improving data diversity helps improve model performance. Similar to model scaling, performance saturated at some point when enough data are used. Graph Generative Pre-trained Transformer 6. Conclusion This work revisits the sequential approach to graph generation and proposes a novel token-based representation that efficiently encodes graph structures via node and edge tokens. This representation serves as the foundation for the proposed Graph Generative Pre-trained Transformer (G2PT), an auto-regressive model that effectively models graph sequences using next-token prediction. Extensive evaluations demonstrated that G2PT achieves remarkable performance across multiple datasets and tasks, including generic graph and molecule generation, as well as downstream tasks like goal-oriented graph generation and graph property prediction. The results highlight G2PT s adaptability and scalability, making it a versatile framework for various applications. One limitation of our method is that G2PT is order-sensitive, where different graph domains may prefer different edge orderings. Future research could be done by exploring edge orderings that are more universal and expressive. Impact Statement This paper introduces a framework that models graphs in a similar vein to GPT (Generative Pre-trained Transformer). The G2PT framework allows seamless plantation of training techniques that have developed in other domains based on GPT. Besides performing generative tasks such as drug discovery, G2PT also can be easily extended for discriminative tasks such as graph property prediction. We hope this work will advance the field of graph learning. As a powerful tool, G2PT may also be used as one step in a complex system to create molecule structures harmful to humans or the environment, but we don t see immediate hazards from our study. Acknowledgment We thank all reviewers for their insightful feedback. Chen and Liu s work was supported by NSF 2239869. He and Xu s work was support by NSF 2239672. Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. ar Xiv preprint ar Xiv:2303.08774, 2023. Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and Van Den Berg, R. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981 17993, 2021. Bacciu, D., Micheli, A., and Podda, M. Edge-based sequential graph generation with recurrent neural networks. Neurocomputing, 416:177 189, 2020. Bagal, V. and Aggarwal, R. Liggpt: Molecular generation using a transformer-decoder model. Bergmeister, A., Martinkus, K., Perraudin, N., and Wattenhofer, R. Efficient and scalable graph generation through iterative local expansion. ar Xiv preprint ar Xiv:2312.11529, 2023. Bergmeister, A., Martinkus, K., Perraudin, N., and Wattenhofer, R. Efficient and scalable graph generation through iterative local expansion, 2024. URL https: //arxiv.org/abs/2312.11529. Brown, N., Fiscato, M., Segler, M. H., and Vaucher, A. C. Guacamol: Benchmarking models for de novo molecular design. Journal of Chemical Information and Modeling, 59(3):1096 1108, March 2019. ISSN 1549960X. doi: 10.1021/acs.jcim.8b00839. URL http: //dx.doi.org/10.1021/acs.jcim.8b00839. Campbell, A., Benton, J., De Bortoli, V., Rainforth, T., Deligiannidis, G., and Doucet, A. A continuous time framework for discrete denoising models. Advances in Neural Information Processing Systems, 35:28266 28279, 2022. Campbell, A., Yim, J., Barzilay, R., Rainforth, T., and Jaakkola, T. Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design. ar Xiv preprint ar Xiv:2402.04997, 2024. Chen, D., O Bray, L., and Borgwardt, K. Structure-aware transformer for graph representation learning. In International Conference on Machine Learning, pp. 3469 3489. PMLR, 2022a. Chen, X., Han, X., Hu, J., Ruiz, F. J., and Liu, L. Order matters: Probabilistic modeling of node sequence for graph generation. ar Xiv preprint ar Xiv:2106.06189, 2021. Chen, X., Li, Y., Zhang, A., and Liu, L.-p. Nvdiff: Graph generation through the diffusion of node vectors. ar Xiv preprint ar Xiv:2211.10794, 2022b. Graph Generative Pre-trained Transformer Chen, X., He, J., Han, X., and Liu, L.-P. Efficient and degreeguided graph generation via discrete diffusion modeling. ar Xiv preprint ar Xiv:2305.04111, 2023. Chen, X., Wang, Y., Du, Y., Hassoun, S., and Liu, L. On separate normalization in self-supervised transformers. Advances in Neural Information Processing Systems, 36, 2024. Dai, H., Nazi, A., Li, Y., Dai, B., and Schuurmans, D. Scalable deep generative modeling for sparse graphs. In International conference on machine learning, pp. 2302 2312. PMLR, 2020. De Cao, N. and Kipf, T. Molgan: An implicit generative model for small molecular graphs. ar Xiv preprint ar Xiv:1805.11973, 2018. Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018. Diamant, N. L., Tseng, A. M., Chuang, K. V., Biancalani, T., and Scalia, G. Improving graph generation by restricting graph bandwidth. In International Conference on Machine Learning, pp. 7939 7959. PMLR, 2023. Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020. Du, Y., Wang, S., Guo, X., Cao, H., Hu, S., Jiang, J., Varala, A., Angirekula, A., and Zhao, L. Graphgt: Machine learning datasets for graph generation and transformation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. Du, Y., Jamasb, A. R., Guo, J., Fu, T., Harris, C., Wang, Y., Duan, C., Li o, P., Schwaller, P., and Blundell, T. L. Machine learning-aided generative molecular design. Nature Machine Intelligence, pp. 1 16, 2024. Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. ar Xiv preprint ar Xiv:2407.21783, 2024. Dwivedi, V. P. and Bresson, X. A generalization of transformer networks to graphs. ar Xiv preprint ar Xiv:2012.09699, 2020. Eijkelboom, F., Bartosh, G., Naesseth, C. A., Welling, M., and van de Meent, J.-W. Variational flow matching for graph generation. ar Xiv preprint ar Xiv:2406.04843, 2024. Esser, P., Rombach, R., and Ommer, B. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12873 12883, 2021. Gao, Z., Dong, D., Tan, C., Xia, J., Hu, B., and Li, S. Z. A graph is worth k words: Euclideanizing graph using pure transformer. ar Xiv preprint ar Xiv:2402.02464, 2024. Gat, I., Remez, T., Shaul, N., Kreuk, F., Chen, R. T., Synnaeve, G., Adi, Y., and Lipman, Y. Discrete flow matching. ar Xiv preprint ar Xiv:2407.15595, 2024. Haefeli, K. K., Martinkus, K., Perraudin, N., and Wattenhofer, R. Diffusion models for graphs benefit from discrete state spaces. ar Xiv preprint ar Xiv:2210.01549, 2022. Han, X., Chen, X., Ruiz, F. J. R., and Liu, L.-P. Fitting autoregressive graph generative models through maximum likelihood estimation. Journal of Machine Learning Research, 24(97):1 30, 2023. URL http: //jmlr.org/papers/v24/22-0337.html. Hihi, S. and Bengio, Y. Hierarchical recurrent neural networks for long-term dependencies. Advances in neural information processing systems, 8, 1995. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840 6851, 2020. Hou, Z., Liu, X., Cen, Y., Dong, Y., Yang, H., Wang, C., and Tang, J. Graphmae: Self-supervised masked graph autoencoders. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 594 604, 2022a. Hou, Z., Liu, X., Cen, Y., Dong, Y., Yang, H., Wang, C., and Tang, J. Graphmae: Self-supervised masked graph autoencoders, 2022b. URL https://arxiv.org/ abs/2205.10803. Hu, W., Liu, B., Gomes, J., Zitnik, M., Liang, P., Pande, V., and Leskovec, J. Strategies for pre-training graph neural networks, 2020a. URL https://arxiv.org/abs/ 1905.12265. Hu, Z., Dong, Y., Wang, K., and Sun, Y. Heterogeneous graph transformer. In Proceedings of the web conference 2020, pp. 2704 2710, 2020b. Huang, K., Fu, T., Gao, W., Zhao, Y., Roohani, Y., Leskovec, J., Coley, C. W., Xiao, C., Sun, J., and Zitnik, M. Artificial intelligence foundation for therapeutic science. Nature chemical biology, 18(10):1033 1036, 2022. Graph Generative Pre-trained Transformer Jang, Y., Lee, S., and Ahn, S. A simple and scalable representation for graph generation. ar Xiv preprint ar Xiv:2312.02230, 2023. Jo, J., Lee, S., and Hwang, S. J. Score-based generative modeling of graphs via the system of stochastic differential equations. In International conference on machine learning, pp. 10362 10383. PMLR, 2022a. Jo, J., Lee, S., and Hwang, S. J. Score-based generative modeling of graphs via the system of stochastic differential equations, 2022b. URL https://arxiv.org/ abs/2202.02514. Jo, J., Kim, D., and Hwang, S. J. Graph generation with diffusion mixture. ar Xiv preprint ar Xiv:2302.03596, 2023. Kim, J., Nguyen, D., Min, S., Cho, S., Lee, M., Lee, H., and Hong, S. Pure transformers are powerful graph learners. Advances in Neural Information Processing Systems, 35: 14582 14595, 2022. Kreuzer, D., Beaini, D., Hamilton, W., L etourneau, V., and Tossou, P. Rethinking graph transformers with spectral attention. Advances in Neural Information Processing Systems, 34:21618 21629, 2021. Lezama, J., Salimans, T., Jiang, L., Chang, H., Ho, J., and Essa, I. Discrete predictor-corrector diffusion models for image synthesis. In The Eleventh International Conference on Learning Representations, 2022. Li, P. and Leskovec, J. The expressive power of graph neural networks. Graph Neural Networks: Foundations, Frontiers, and Applications, pp. 63 98, 2022. Li, Y., Vinyals, O., Dyer, C., Pascanu, R., and Battaglia, P. Learning deep generative models of graphs. ar Xiv preprint ar Xiv:1803.03324, 2018. Liao, R., Li, Y., Song, Y., Wang, S., Hamilton, W., Duvenaud, D. K., Urtasun, R., and Zemel, R. Efficient graph generation with graph recurrent attention networks. Advances in neural information processing systems, 32, 2019. Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. ar Xiv preprint ar Xiv:2210.02747, 2022. Liu, J., Kumar, A., Ba, J., Kiros, J., and Swersky, K. Graph normalizing flows. Advances in Neural Information Processing Systems, 32, 2019. Liu, Q., Allamanis, M., Brockschmidt, M., and Gaunt, A. Constrained graph variational autoencoders for molecule design. Advances in neural information processing systems, 31, 2018. Liu, S., Wang, H., Liu, W., Lasenby, J., Guo, H., and Tang, J. Pre-training molecular graph representation with 3d geometry, 2022a. URL https://arxiv.org/abs/ 2110.07728. Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow. ar Xiv preprint ar Xiv:2209.03003, 2022b. Liu, Z., Lu, M., Zhang, S., Liu, B., Guo, H., Yang, Y., Blanchet, J., and Wang, Z. Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer. ar Xiv preprint ar Xiv:2405.16436, 2024. Loshchilov, I. and Hutter, F. Decoupled weight decay regularization, 2019. URL https://arxiv.org/abs/ 1711.05101. Luo, Y., Yan, K., and Ji, S. Graphdf: A discrete flow model for molecular graph generation. In International conference on machine learning, pp. 7192 7203. PMLR, 2021. Madhawa, K., Ishiguro, K., Nakago, K., and Abe, M. Graphnvp: An invertible flow model for generating molecular graphs. ar Xiv preprint ar Xiv:1905.11600, 2019. Martinkus, K., Loukas, A., Perraudin, N., and Wattenhofer, R. Spectre: Spectral conditioning helps to overcome the expressivity limits of one-shot graph generators, 2022. URL https://arxiv.org/abs/2204.01613. Min, E., Chen, R., Bian, Y., Xu, T., Zhao, K., Huang, W., Zhao, P., Huang, J., Ananiadou, S., and Rong, Y. Transformer for graphs: An overview from architecture perspective. ar Xiv preprint ar Xiv:2202.08455, 2022. Niu, C., Song, Y., Song, J., Zhao, S., Grover, A., and Ermon, S. Permutation invariant graph generation via score-based generative modeling. In International Conference on Artificial Intelligence and Statistics, pp. 4474 4484. PMLR, 2020. Olivecrona, M., Blaschke, T., Engkvist, O., and Chen, H. Molecular de-novo design through deep reinforcement learning. Journal of cheminformatics, 9:1 14, 2017. Polykovskiy, D., Zhebrak, A., Sanchez-Lengeling, B., Golovanov, S., Tatanov, O., Belyaev, S., Kurbanov, R., Artamonov, A., Aladinskiy, V., Veselov, M., Kadurin, A., Johansson, S., Chen, H., Nikolenko, S., Aspuru-Guzik, A., and Zhavoronkov, A. Molecular sets (moses): A benchmarking platform for molecular generation models, 2020. URL https://arxiv.org/abs/1811.12823. Preuer, K., Renz, P., Unterthiner, T., Hochreiter, S., and Klambauer, G. Fr echet chemnet distance: A metric for generative models for molecules in drug discovery, 2018. URL https://arxiv.org/abs/1803.09518. Graph Generative Pre-trained Transformer Qin, Y., Madeira, M., Thanou, D., and Frossard, P. Defog: Discrete flow matching for graph generation, 2024. URL https://arxiv.org/abs/2410.04263. Radford, A. Improving language understanding by generative pre-training. 2018. Ramp aˇsek, L., Galkin, M., Dwivedi, V. P., Luu, A. T., Wolf, G., and Beaini, D. Recipe for a general, powerful, scalable graph transformer. Advances in Neural Information Processing Systems, 35:14501 14515, 2022. Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation. ar Xiv preprint ar Xiv:1506.02438, 2015. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. Sherstinsky, A. Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network. Physica D: Nonlinear Phenomena, 404:132306, 2020. Simonovsky, M. and Komodakis, N. Graphvae: Towards generation of small graphs using variational autoencoders. In Artificial Neural Networks and Machine Learning ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4-7, 2018, Proceedings, Part I 27, pp. 412 422. Springer, 2018. Siraudin, A., Malliaros, F. D., and Morris, C. Cometh: A continuous-time discrete-state graph diffusion model. ar Xiv preprint ar Xiv:2406.06449, 2024. Sun, F.-Y., Hoffmann, J., Verma, V., and Tang, J. Infograph: Unsupervised and semi-supervised graph-level representation learning via mutual information maximization, 2020. URL https://arxiv.org/abs/1908. 01000. Vaswani, A. Attention is all you need. Advances in Neural Information Processing Systems, 2017. Vignac, C., Krawczuk, I., Siraudin, A., Wang, B., Cevher, V., and Frossard, P. Digress: Discrete denoising diffusion for graph generation. ar Xiv preprint ar Xiv:2209.14734, 2022. Wang, Y., Chen, X., Liu, L., and Hassoun, S. Madgen mass-spec attends to de novo molecular generation, 2025. URL https://arxiv.org/abs/2501.01950. Wu, M., Chen, X., and Liu, L.-P. Edge++: Improved training and sampling of edge. ar Xiv preprint ar Xiv:2310.14441, 2023. Wu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Geniesse, C., Pappu, A. S., Leswing, K., and Pande, V. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2):513 530, 2018a. Wu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Geniesse, C., Pappu, A. S., Leswing, K., and Pande, V. Moleculenet: A benchmark for molecular machine learning, 2018b. URL https://arxiv.org/abs/ 1703.00564. Wu, Z., Jain, P., Wright, M., Mirhoseini, A., Gonzalez, J. E., and Stoica, I. Representing long-range context for graph neural networks with global attention. Advances in Neural Information Processing Systems, 34:13266 13279, 2021. Xu, Z., Qiu, R., Chen, Y., Chen, H., Fan, X., Pan, M., Zeng, Z., Das, M., and Tong, H. Discrete-state continuoustime diffusion for graph generation. ar Xiv preprint ar Xiv:2405.11416, 2024. Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y., and Liu, T.-Y. Do transformers really perform badly for graph representation? Advances in neural information processing systems, 34:28877 28888, 2021. You, J., Liu, B., Ying, Z., Pande, V., and Leskovec, J. Graph convolutional policy network for goal-directed molecular graph generation. Advances in neural information processing systems, 31, 2018a. You, J., Ying, R., Ren, X., Hamilton, W., and Leskovec, J. Graphrnn: Generating realistic graphs with deep autoregressive models. In International conference on machine learning, pp. 5708 5717. PMLR, 2018b. You, Y., Chen, T., Sui, Y., Chen, T., Wang, Z., and Shen, Y. Graph contrastive learning with augmentations, 2021. URL https://arxiv.org/abs/2010.13902. Zang, C. and Wang, F. Moflow: an invertible flow model for generating molecular graphs. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 617 626, 2020. Zheng, R., Dou, S., Gao, S., Hua, Y., Shen, W., Wang, B., Liu, Y., Jin, S., Liu, Q., Zhou, Y., et al. Secrets of rlhf in large language models part i: Ppo. ar Xiv preprint ar Xiv:2307.04964, 2023. Zhu, Y., Du, Y., Wang, Y., Xu, Y., Zhang, J., Liu, Q., and Wu, S. A survey on deep graph generation: Methods and applications. In Learning on Graphs Conference, pp. 47 1. PMLR, 2022. Zhu, Y., Chen, D., Du, Y., Wang, Y., Liu, Q., and Wu, S. Molecular contrastive pretraining with collaborative Graph Generative Pre-trained Transformer featurizations. Journal of Chemical Information and Modeling, 64(4):1112 1122, 2024. Graph Generative Pre-trained Transformer A. Reinforcement Learning Details A.1. Preliminaries on Proximal Policy Optimization (PPO) Generalized Advantage Estimation. In reinforcement learning, the Q function Q(s 0.2, > 0.4, > 0.6, > 0.8}. We observe that the pre-trained model s score distribution is skewed towards 0, making it challenging to generate satisfactory samples. To resolve this, we fine-tune the model at the 0.2 threshold and progressively bootstrap it through intermediate thresholds (0.4, 0.6) up to 0.8, performing three bootstrapping steps in total. All models are trained for 6000 iterations, with batch size of 120 and learning rate of 1e-5. The learning rate gradually decay to 0 using Cosine scheduler. Graph Generative Pre-trained Transformer Reinforcement learning. We use the PPO algorithm to further optimize the pre-trained model. In practice, the token-level reward R([s