# retrobridge_modeling_retrosynthesis_with_markov_bridges__fa7e801e.pdf Published as a conference paper at ICLR 2024 RETROBRIDGE: MODELING RETROSYNTHESIS WITH MARKOV BRIDGES Ilia Igashov1 , Arne Schneuing1 , Marwin Segler2, Michael Bronstein3 & Bruno Correia1 1 Ecole Polytechnique F ed erale de Lausanne, 2Microsoft Research, 3University of Oxford {ilia.igashov,arne.schneuing,bruno.correia}@epfl.ch Retrosynthesis planning is a fundamental challenge in chemistry which aims at designing multi-step reaction pathways from commercially available starting materials to a target molecule. Each step in multi-step retrosynthesis planning requires accurate prediction of possible precursor molecules given the target molecule and confidence estimates to guide heuristic search algorithms. We model single-step retrosynthesis as a distribution learning problem in a discrete state space. First, we introduce the Markov Bridge Model, a generative framework aimed to approximate the dependency between two intractable discrete distributions accessible via a finite sample of coupled data points. Our framework is based on the concept of a Markov bridge, a Markov process pinned at its endpoints. Unlike diffusion-based methods, our Markov Bridge Model does not need a tractable noise distribution as a sampling proxy and directly operates on the input product molecules as samples from the intractable prior distribution. We then address the retrosynthesis planning problem with our novel framework and introduce Retro Bridge, a template-free retrosynthesis modeling approach that achieves state-of-the-art results on standard evaluation benchmarks. 1 INTRODUCTION Computational and machine learning methods for de novo drug design show great promise as more cost-effective alternatives to experimental high-throughput screening approaches (Thomas et al., 2023) to propose molecules with desirable properties. While in silico results suggest high predicted target binding affinities and other favorable properties of the generated molecules, limited emphasis has so far been placed on their synthesizability (Stanley & Segler, 2023). For laboratory testing, synthetic pathways need to be developed for the newly designed molecules, which is an extremely challenging and time-consuming task. Retrosynthesis planning (Corey, 1991; Strieth-Kalthoff et al., 2020; Tu et al., 2023) tools address this challenge by proposing reaction steps or entire pathways that can be validated and optimized in the lab. Single-step retrosynthesis models predict precursor molecules for a given target molecule (Segler & Waller, 2017; Coley et al., 2017; Liu et al., 2017; Strieth-Kalthoff et al., 2020; Tu et al., 2023). Applying these methods recursively allows to decompose the initial molecule in progressively simpler intermediates and eventually reach available starting molecules (Segler et al., 2018). While most works have used a discriminative formulation for retrosynthesis modeling (Strieth Kalthoff et al., 2020; Tu et al., 2023; Jiang et al., 2022), we propose to view the task as a conditional distribution learning problem, as shown in Figure 1. This approach has several advantages, including the ability to model uncertainty and to generate new and diverse retrosynthetic pathways. Furthermore, and most importantly, the probabilistic formulation reflects the fact that the same product molecule can often be synthesized with different sets of reactants and reagents. Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020) and other modern score-based and flow-based generative methods (Rezende & Mohamed, 2015; Song et al., 2020; Lipman et al., 2022; These authors contributed equally Published as a conference paper at ICLR 2024 Figure 1: Markov bridges between the distribution of products and distribution of reactants. Albergo & Vanden-Eijnden, 2022; Albergo et al., 2023) may seem like good candidates for retrosynthesis modeling. However, as we show in this work, such models do not fit naturally to the formulation of the problem, as they are designed to approximate a single intractable data distribution. To do so, one typically samples initial noise from a simple prior distribution and then maps it to a data point that follows a complex target distribution. In contrast, we aim to learn the dependency between two intractable distributions rather than one intractable distribution itself. While this can be achieved by conditioning the sampling process on the relevant context and keep sampling from the prior noise, we show here that such use of the original unconditional generative idea is suboptimal for approximating the dependency between two discrete distributions. In this work, we propose Retro Bridge, a template-free probabilistic method for single-step retrosynthesis modeling. As shown in Figure 1, we model the dependency between the spaces of products and reactants as a stochastic process that is constrained to start and to end at specific data points. To this end, we introduce the Markov Bridge Model, a generative model that learns the dependency between two intractable discrete distributions through the finite sample of coupled data points. Taking a product molecule as input, our method models the trajectories of Markov bridges starting at the given product and ending at data points following the distribution of reactants. To score reactant graphs sampled in this way, we leverage the probabilistic nature of Retro Bridge and measure its uncertainty at each sample. We demonstrate that Retro Bridge achieves competitive results on standard retrosynthesis modeling benchmarks. Besides, we compare Retro Bridge with the state-of-the-art graph diffusion model Di Gress (Vignac et al., 2022), and demonstrate quantitatively and qualitatively that the proposed Markov Bridge Model is better suited to tasks where two intractable discrete distributions need to be mapped. Our source code is available at https://github.com/igashov/Retro Bridge. To summarise, the main contributions of this work are the following: We introduce the Markov Bridge Model to approximate the probabilistic dependency between two intractable discrete distributions accessible via a finite sample of coupled data points. We demonstrate the superiority of the proposed formulation over diffusion models in the context of learning the dependency between two intractable discrete distributions. We propose Retro Bridge, the first Markov Bridge Model for retrosynthesis modeling. Retro Bridge is a template-free single-step retrosynthesis prediction method that achieves state-of-the-art results on standard benchmarks. 2 RELATED WORK Diffusion Models Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020) form a class of powerful and effective score-based generative methods that have recently achieved promising Published as a conference paper at ICLR 2024 results in many different domains including protein design (Watson et al., 2023), small molecule generation (Hoogeboom et al., 2022; Igashov et al., 2022; Schneuing et al., 2022), molecular docking (Corso et al., 2022), and sampling of transition state molecular structures (Duan et al., 2023; Kim et al., 2023). While most models are designed for the continuous data domain, a few methods were proposed to operate on discrete data (Hoogeboom et al., 2021; Johnson et al., 2021; Austin et al., 2021; Yang et al., 2023) and, in particular, on discrete graphs (Vignac et al., 2022). To the best of our knowledge, however, no diffusion models have been applied to modeling chemical reactions and recovering retrosynthetic pathways. Schr odinger Bridge Problem Given two distributions and a reference stochastic process between them, solving the Schr odinger bridge (SB) problem (Schr odinger, 1932; L eonard, 2013) amounts to finding a process closest to the reference in terms of Kullback-Leibler divergence on path spaces. While most recent methods employ the SB formalism in the context of unconditional generative modeling (Vargas et al., 2021; Wang et al., 2021; De Bortoli et al., 2021; Chen et al., 2021; Bunne et al., 2023; Liu et al., 2022), a few works aimed to approximate the reference stochastic process through training on coupled samples from two continuous distributions (Holdijk et al., 2022; Somnath et al., 2023). To the best of our knowledge, there are no methods operating on categorical distributions, which is the subject of the present work. Retrosynthesis Modeling Recent retrosynthesis prediction methods can be divided into two main groups: template-based and template-free methods (Jiang et al., 2022). While template-based methods depend on predefined sets of specific reaction templates or leaving groups, template-free methods are less restricted and therefore are able to explore new reaction pathways. Two common data representations used for retrosynthesis prediction are symbolic representations SMILES (Weininger, 1988) and molecular graphs. A variety of language models have been recently proposed (Liu et al., 2017; Zheng et al., 2019; Sun et al., 2021; Tetko et al., 2020) to operate on SMILES, some of which (Jiang et al., 2023; Zhong et al., 2022) were additionally pre-trained on larger datasets. Due to the nature of the sequence-to-sequence translation problem, all these methods are template-free. Among the existing graph-based methods (Segler & Waller, 2017), the most recent template-based ones are GLN (Dai et al., 2019), Graph Retro (Somnath et al., 2021) and Local Retro (Chen & Jung, 2021), and template-free approaches are G2G (Shi et al., 2020) and MEGAN (Sacha et al., 2021). Template-free methods GTA (Seo et al., 2021), Graph2SMILES (Tu & Coley, 2022) and Retroformer (Wan et al., 2022) leverage both graph and SMILES representations. In this work, we propose a novel template-free graph-based method. 3 RETROBRIDGE We frame the retrosynthesis prediction task as a generative problem of modeling a stochastic process between two discrete-valued distributions of products p X and reactants p Y. These distributions are intractable and are represented by a finite collection of D coupled samples {(xi, yi)}D i=1, where xi p X (xi) is a product molecule and yi p Y(yi) is a corresponding set of reactant molecules. While products and reactants follow distributions p X and p Y respectively, there is a dependency between these variables that can be expressed in the form of the joint distribution p X,Y such that R p X,Y(x, y)dx = p Y(y) and R p X,Y(x, y)dy = p X (x). The joint distribution p X,Y is also intractable and accessible only through the discrete sample of coupled data points {(xi, yi)}D i=1. First, we introduce the Markov Bridge Model, a general framework for learning the dependency between two intractable discrete-valued distributions. Next, we discuss a special case where random variables are molecular graphs. Upon this formulation, we introduce Retro Bride, a Markov Bridge Model for single-step retrosynthesis modeling. Finally, we explain a simple but rather effective way of scoring Retro Bridge samples based on the statistical uncertainty of the model. 3.1 MARKOV BRIDGE MODEL We model the dependency between two discrete spaces X and Y by a Markov bridge (Fitzsimmons et al., 1992; C etin & Danilova, 2016), which is a Markov process pinned to specific data points in the beginning and in the end. For a pair of samples (x, y) p X,Y(x, y) and a sequence of time steps t = 0, 1, . . . , T, we define the corresponding Markov bridge as a sequence of random variables Published as a conference paper at ICLR 2024 Product Reactant Figure 2: The process of changing atom types along the trajectory of the Markov bridge. The trajectory starts at time step t = 0 with the product molecule and several disconnected dummy atoms that will be included in the final reactant molecule. The probability of sampling the target atom type increases as t grows. Five circles filled with different colors represent these probabilities. To make the illustration less bulky, we omitted a part of the product molecule and one of two reactant molecules. (zt)T t=0, that starts at x, i.e., z0 = x, and satisfies the Markov property, p(zt|z0, z1, . . . , zt 1, y) = p(zt|zt 1, y). (1) To pin the process at the data point y, we introduce an additional requirement, p(z T = y|z T 1, y) = 1. (2) Assuming that both distributions p X and p Y are categorical with a finite sample space {1, . . . , K}, we can represent data points as K-dimensional one-hot vectors: x, y, zt RK. To model a Markov bridge defined by equations (1-2), similar to Austin et al. (2021), we introduce a sequence of transition matrices Q0, Q1, . . . , QT 1 RK K, defined as Qt := Qt(y) = αt IK + (1 αt)y1 K, (3) where IK is a K K identity matrix, 1K is a K-dimensional all-one vector, and αt is a schedule parameter transitioning from α0 = 1 to αT 1 = 0. Transition probabilities (1) can be written as follows, p(zt+1|zt, y) = Cat (zt+1; Qtzt) , (4) where Cat( ; p) is a categorical distribution with probabilities given by p. We note that setting αT 1 = 0 ensures the requirement (2). Using the finite set of coupled samples {(xi, yi)}D i=1 p X,Y, our goal is to learn a Markov bridge (1-2) to be able to sample y when only x is available. To do this, we replace y with an approximation ˆy computed with a neural network φθ: ˆy = φθ(zt, t), (5) and define an approximated transition kernel, qθ(zt+1|zt) = Cat (zt+1; Qt(ˆy)zt) . (6) We train φθ by maximizing a lower bound of log-likelihood log qθ(y|x). As shown in Appendix A.1, it has the following closed-form expression, log qθ(y|x) T Et U(0,...,T 1)Ezt p(zt|x,y)DKL (p(zt+1|zt, y) qθ(zt+1|zt)) . (7) For any x X, y Y, and t = 1, . . . , T, sampling of zt can be effectively performed using a cumulative product matrix Qt = Qt Qt 1...Q0. As shown in Appendix A.2, the cumulative matrix Qt can be written in closed form, Qt = αt IK + (1 αt)y1 K, (8) where αt = Qt s=0 αs. Therefore, p(zt+1|z0, z T ) can be written as follows, p(zt+1|z0, z T ) = Cat zt+1; Qtz0 . (9) To sample a data point y z T starting from a given z0 x p X (x), one iteratively predicts ˆy = φθ(zt, t) and then derives zt+1 qθ(zt+1|zt) = Cat (zt+1; Qt(ˆy)zt) for t = 0, . . . , T 1. Training and sampling procedures of the Markov Bridge Model are provided in Algorithms 1 and 2 respectively. Published as a conference paper at ICLR 2024 Algorithm 1 Training of the Markov Bridge Model Input: coupled sample (x, y) p X,Y, neural network φθ t U(0, . . . , T 1), zt Cat zt; Qt 1x Sample time step and intermediate state ˆy φθ(zt, t) Output of φθ is a vector of probabilities p(zt+1|zt, y) Cat (zt+1; Qt(y)zt) Reference transition distribution qθ(zt+1|zt) Cat (zt+1; Qt(ˆy)zt) Approximated transition distribution Minimize DKL (p(zt+1|zt, y) qθ(zt+1|zt)) Algorithm 2 Sampling Input: starting point x p X , neural network φθ z0 x for t in 0, ..., T 1: ˆy φθ(zt, t) Output of φθ is a vector of probabilities qθ(zt+1|zt) Cat (zt+1; Qt(ˆy)zt) Approximated transition distribution zt+1 qθ(zt+1|zt) Return z T 3.2 RETROBRIDGE: MARKOV BRIDGE MODEL FOR RETROSYNTHESIS PLANNING In our setup, each data point is a molecular graph with nodes representing atoms and edges corresponding to covalent bonds. We represent the molecular graph with a matrix of node features H RN Ka which are, for instance, one-hot encoded atom types, and a tensor of edge features E RN N Ke, which can be one-hot encoded bond types. In the scope of our probabilistic framework, we consider such a molecular graph representation as a collection of independent categorical random variables. More formally, we denote product and reactants data points x and y as tuples of the corresponding node and edge feature tensors: x = [Hx, Ex] and y = [Hy, Ey]. For such complex data points, we modify the definitions of transition matrices and probabilities accordingly: [QH t ]j = αt IKa + (1 αt)1Ka[Hy]j, p(Ht+1|Ht, Hy) = Cat Ht+1; Ht QH t , (10) [QE t ]j,k = αt IKe + (1 αt)1Ke[Ey]j,k, p(Et+1|Et, Ey) = Cat Et+1; Et QE t , (11) where [H]j R1 Ka is the j-th row of the feature matrix H (i.e. transposed feature vector of the j-th node), and [E]j,k R1 Ke is the transposed feature vector of the edge between j-th and k-th nodes. Because some atoms present in the reactant molecules can be absent in the corresponding product molecule, we add dummy nodes to the initial graph of the product. As shown in Figure 2, some dummy nodes are transformed into atoms of reactant molecules. In our experiments, we always add 10 dummy nodes to the initial product graphs. 3.3 CONFIDENCE AND SCORING It is important to have a reliable scoring method that selects the most relevant sets of reactants out of all generated samples. In order to rank Retro Bridge samples, we benefit from the probabilistic nature of the model and utilize its confidence in the generated samples as a scoring function. We estimate the confidence of the model by computing the likelihood qθ(y|x) of a set of reactants y for an input product molecule x. For a set of M samples {z(i) T }M i=1 generated by Retro Bridge for an input product x, we compute the likelihood-based confidence score for the set of reactants y as follows, qθ(y|x) = Ey qθ( |x)1{y = y} 1 i=1 1{z(i) T = y}. (12) Published as a conference paper at ICLR 2024 4.1 EXPERIMENTAL SETUP Dataset For all the experiments we use the USPTO-50k dataset (Schneider et al., 2016) which includes 50k reactions found in the US patent literature. We use standard train/validation/test splits provided by Dai et al. (2019). Somnath et al. (2021) report that the dataset contains a shortcut in that the product atom with atom-mapping 1 is part of the edit in almost 75% of the cases. Even though our model does not depend on the order of graph nodes, we utilize the dataset version with canonical SMILES provided by Somnath et al. (2021). Besides, we randomly permute graph nodes once SMILES are read and converted to graphs. Baselines We compare Retro Bridge with template-based methods GLN (Dai et al., 2019), Local Retro (Chen & Jung, 2021), and Graph Retro (Somnath et al., 2021), and template-free methods MEGAN (Sacha et al., 2021), G2G (Shi et al., 2020), Augmented Transformer Tetko et al. (2020), SCROP (Zheng et al., 2019), Tied Transformer (Tetko et al., 2020), GTA (Seo et al., 2021), Dual TF (Sun et al., 2021), Graph2SMILES (Tu & Coley, 2022) and Retroformer (Wan et al., 2022). We obtained GLN predictions using the publicly available code and model weights1 and used the latest Local Retro predictions provided by its authors. Additionally, MEGAN was originally trained and evaluated on random data splits, so we retrained and evaluated it ourselves. Finally, we compare Retro Bridge with the state-of-the-art discrete graph diffusion model Di Gress Vignac et al. (2022) and a template-free baseline based on a graph transformer architecture (Dwivedi & Bresson, 2020; Vignac et al., 2022). Evaluation For each input product, we sample 100 reactant sets and report top-k exact match accuracy (k = 1, 3, 5, 10) which is measured as the proportion of input products for which the method managed to produce the correct set of reactants in its top-k samples. Subsequently, for topk samples produced for every input product, we run the forward reaction prediction model Molecular Transformer (Schwaller et al., 2019) and report round-trip accuracy and coverage (Schwaller et al., 2020). Round-trip accuracy is the percentage of correctly predicted reactants among all predictions, where reactants are considered correct either if they match the ground truth or if they lead back to the input product. Round-trip coverage, on the other hand, measures if there is at least one correct prediction in the top-k according to the definition above. These metrics reflect the fact that one product can be mapped to multiple different valid sets of reactants, as shown in Figure 1. 4.2 NEURAL NETWORK We use a graph transformer network (Dwivedi & Bresson, 2020; Vignac et al., 2022) to approximate the final state of the Markov bridge process. We represent molecules as fully-connected graphs where node features are one-hot encoded atom types (sixteen atom types and additional dummy type) and edge features are covalent bond types (three bond types and additional none type). Similarly to Vignac et al. (2022) we compute several graph-related node and global features that include number of cycles and spectral graph features. Details on the network architecture, hyperparameters and training process are provided in Appendix A.3. 4.3 RETROSYNTHESIS MODELING Here we report top-k and round-trip accuracy for Retro Bridge and other state-of-the-art methods on the USPTO-50k test set. Table 1 provides exact match accuracy results, and Table 2 reports round-trip accuracy computed using Molecular Transformer (Schwaller et al., 2019). For completeness, we compared exact match accuracy results of Retro Bridge with both templatefree and template-based methods. However, template-free modeling is a more challenging task, and Retro Bridge is template-free, therefore we primarily focus on the latter group of methods. As shown in Table 1, Retro Bridge performs on par with the state of the art in top-1 accuracy and, most importantly, outperforms other template-free methods in all other top-k accuracy metrics, in particular 1Note that the top-k exact match accuracies differ from the one originally reported values because we deduplicate outputs for our evaluation. Published as a conference paper at ICLR 2024 Table 1: Top-k accuracy (exact match) on the USPTO-50k test dataset. The best performing template-based (TB) and template-free (TF) methods are highlighted in bold. Model k = 1 k = 3 k = 5 k = 10 GLN (Dai et al., 2019) 52.5 74.7 81.2 87.9 Graph Retro (Somnath et al., 2021) 53.7 68.3 72.2 75.5 Local Retro (Chen & Jung, 2021) 52.6 76.0 84.4 90.6 SCROP (Zheng et al., 2019) 43.7 60.0 65.2 68.7 G2G (Shi et al., 2020) 48.9 67.6 72.5 75.5 Aug. Transformer (Tetko et al., 2020) 48.3 73.4 77.4 Dual TFaug (Sun et al., 2021) 53.6 70.7 74.6 77.0 MEGAN (Sacha et al., 2021) 48.0 70.9 78.1 85.4 Tied Transformer (Kim et al., 2021) 47.1 67.1 73.1 76.3 GTAaug (Seo et al., 2021) 51.1 67.6 74.8 81.6 Graph2SMILES (Tu & Coley, 2022) 52.9 66.5 70.0 72.9 Retroformeraug (Wan et al., 2022) 52.9 68.2 72.5 76.4 Retro Bridge (ours) 50.8 74.1 80.6 85.6 Table 2: Top-k round-trip coverage and accuracy on the USPTO-50k test dataset. The best performing template-based (TB) and template-free (TF) methods are highlighted in bold. Coverage Accuracy Model k = 1 k = 3 k = 5 k = 1 k = 3 k = 5 GLN (Dai et al., 2019) 82.5 92.0 94.0 82.5 71.0 66.2 Local Retro (Chen & Jung, 2021) 82.1 92.3 94.7 82.1 71.0 66.7 MEGAN (Sacha et al., 2021) 78.1 88.6 91.3 78.1 67.3 61.7 Graph2SMILES (Tu & Coley, 2022) 76.7 56.0 46.4 Retroformeraug (Wan et al., 2022) 78.6 71.8 67.1 Retro Bridge (ours) 85.1 95.7 97.1 85.1 73.6 67.8 highly optimized transformer-based models. We emphasize that performance in top-5 accuracy is the most relevant metric as this number is the closest to the typically expected or desired breadth of the multi-step retrosynthesis planning tree (Maziarz et al., 2023). Hence, as shown in Table 3, our model was optimized for top-5 accuracy. Additional baseline methods and more information on their technical aspects are provided in Table 4. Because exact match accuracy cannot reflect the complete picture of dependencies between spaces of products and reactants, we additionally measure round-trip results using an orthogonal method for forward reaction prediction. We evaluate predictions of the template-based methods GLN (Dai et al., 2019) and Local Retro (Chen & Jung, 2021), and template-free methods Graph2SMILES (Tu & Coley, 2022) and Retroformer (Wan et al., 2022) as well as the retrained version of templatefree MEGAN (Sacha et al., 2021) ourselves for comparison. The results are reported in Table 2. Retro Bridge clearly outperforms the template-free baseline and, in spite of a much higher difficulty of the template-free setup, even achieves higher round-trip coverage and accuracy values than stateof-the-art template-based methods. These results support our hypothesis that retrosynthesis should be modeled in a probabilistic framework considering the absence of a unique set of reactants for a given initial product molecule. 4.4 ADDITIONAL EXPERIMENTS In this section, we compare Retro Bridge with the na ıve adaptation of Di Gress (Vignac et al., 2022) for retrosynthesis prediction, which can be considered the most comparable diffusion-based method for this task, and with a graph transformer network that predicts reactants in a one-shot fashion. Furthermore, we study the effect of adding an input product molecule as context to the neural network φθ at each sampling step. More precisely, for models with no context we use the formulation Published as a conference paper at ICLR 2024 Table 3: Additional experiments: top-k accuracy on the USPTO-50k validation set. Model k = 1 k = 3 k = 5 k = 10 k = 50 Di Gress (context) 47.32 68.56 73.93 78.45 80.88 Retro Bridge-CE (no context) 48.71 66.84 72.33 76.08 79.38 Retro Bridge-CE (context) 50.74 71.50 76.58 79.50 80.58 Retro Bridge-VLB (no context) 47.42 69.46 75.21 79.40 83.82 Retro Bridge-VLB (context) 48.92 73.04 79.44 83.74 86.31 (5), while models with context compute predictions as ˆy = φθ(zt, x, t). Finally, we try a simpler cross-entropy (CE) loss function, as used in Di Gress, that directly compares approximated reactants with the ground-truth, LCE(θ) = T Et U(0,...,T 1)Ezt,z T p Cross Entropy(z T , φθ(zt, t)). (13) In all experiments we use the same neural network architectures and sets of hyperparameters. We perform our evaluation on the USPTO-50k validation set and report top-k accuracy of the generated samples in Table 3. First of all, we observe that iterative sampling as performed by diffusion models or the Markov Bridge Model is essential for solving the problem of mapping between two graph distributions. Our one-shot graph transformer model trained for a comparable amount of time does not manage to recover any of the reactants. Indeed, it is an extremely challenging task as even a single incorrectly predicted bond or atom type is detrimental for the accuracy metric. To the best of our knowledge, all one-shot graph-based models proposed for retrosynthesis prediction fall into the category of template-based methods. In this case, the networks do not predict the entire set of reactants right away, but instead aim to solve much simpler tasks such as prediction of graph edits. To obtain the final set of reactants, graph edit predictions should be further processed by an additional block that typically relies on the predefined reaction templates or leaving group dictionaries. Next, we compare Retro Bridge with Di Gress to demonstrate that the Markov bridge formulation is more suitable than diffusion models when two discrete graph distributions are to be mapped. As shown in Table 3, Retro Bridge outperforms Di Gress with context in all metrics. We note that if we do not pass the context, Di Gress predictably does not manage to recover any reactants. This result illustrates that the Markov bridge framework captures the underlying structure of the task much more naturally than diffusion models. A diffusion model maps sampled noise to the reactants having access to the input product molecule only through the additional context while the Markov bridge model starts each sampling trajectory with a product molecule from the intractable distribution p X (x) = R p X,Y(x, y)dy. Finally, we demonstrate that the variational lower bound loss (7) works better than a simplified crossentropy loss (13) proposed by Vignac et al. (2022). Ultimately, we find it beneficial to include the input product as context at each sampling step. Unlike the diffusion approach, however, Retro Bridge achieves reasonable accuracy values even without additional context as large parts of the product structure are retained throughout the sampling trajectory. 4.5 EXAMPLES Figure 3 shows three examples of reactions we randomly selected from the USPTO-50k test set. For each of these examples, we provide top-3 Retro Bridge samples and the corresponding confidence scores. In all three cases our model managed to recover the correct set of reactants. In the first case, Retro Bridge predicts the correct reactants with a high confidence. In this case, the gap between the scores of the first and the second prediction is remarkably high (0.66 vs 0.12). In both other cases, the model is not as confident in the answer. This uncertainty is reflected in the scores: top-1 samples (which are not correct) have scores 0.18 and 0.38 respectively (cf. 0.17 and 0.2 for the correct ones). More examples are provided in Figure 6. Published as a conference paper at ICLR 2024 Figure 3: Examples of modeled reactants. We selected three random inputs from the test set and for each of them we provide the top-3 Retro Bridge predictions along with their confidence scores. Two check marks indicate that sampled reactants are the same as the ground truth, and one check mark means that reactants are different, but Molecular Transformer (Schwaller et al., 2019) predicts the product molecule used as input. 5 CONCLUSION In this work, we introduce the Markov Bridge Model, a new generative framework for tasks that involve learning dependencies between two intractable discrete-valued distributions. We furthermore apply the new methodology to the retrosynthesis prediction problem, which is an important challenge in medicinal chemistry and drug discovery. Our template-free method, Retro Bridge, achieves state-of-the-art results on common evaluation benchmarks. Importantly, our experiments show that choosing a suitable probabilistic modeling framework positively affects the performance on this task compared to the straightforward adaptation of diffusion models. In order to make Retro Bridge more applicable in practice, our future work aims to address several limitations present in Retro Bridge and also common to other similar retrosynthesis modeling methods. First, despite its constraints, template-based planning might still be preferred by many chemists as it is more likely to adhere to established sets of reactants, reagents and reaction types. Therefore, guiding the generation process towards a specific set of compounds is a potential improvement that will make our method more useful in practice. Second, chemical reactions are heavily dependent on experimental conditions and additional reagents. Like most related methods, Retro Bridge does not directly predict such conditions, nor does it provide explicit information about the required reaction type, but could be adapted towards this setup. While suggesting entire lab protocols seems out-of-reach with current methods, conditioning the generation process on this additional context could already make our method more applicable to real world scenarios. Finally, most pharmaceutically relevant reaction pathways consist of several steps. Therefore, future work will also assess Retro Bridge s performance in a multi-step setting by combining it with existing multi-step planning algorithms. While this work is focused on the retrosynthesis modeling task, we note that application of Markov Bridge Models is not limited to this problem. The proposed framework can be used in many other settings where two discrete distributions accessible via a finite sample of coupled data points need to be mapped. Such applications include but are not limited to image-to-image translation, inpainting, text translation and design of protein binders. We leave exploration of Markov Bridge Models in the scope of these and other possible challenges for future work. ACKNOWLEDGMENTS We thank Max Welling, Philippe Schwaller, Rebecca Neeser, Cl ement Vignac and Anar Rzayev for helpful feedback and insightful discussions. Ilia Igashov has received funding from the European Union s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 945363. Arne Schneuing is supported by Microsoft Research AI4Science. Published as a conference paper at ICLR 2024 Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. ar Xiv preprint ar Xiv:2209.15571, 2022. Michael S Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions. ar Xiv preprint ar Xiv:2303.08797, 2023. Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981 17993, 2021. Charlotte Bunne, Ya-Ping Hsieh, Marco Cuturi, and Andreas Krause. The schr odinger bridge between gaussian measures has a closed form. In International Conference on Artificial Intelligence and Statistics, pp. 5802 5833. PMLR, 2023. Umut C etin and Albina Danilova. Markov bridges: Sde representation. Stochastic Processes and their Applications, 126(3):651 679, 2016. Shuan Chen and Yousung Jung. Deep retrosynthetic reaction prediction using local reactivity and global attention. JACS Au, 1(10):1612 1620, 2021. Tianrong Chen, Guan-Horng Liu, and Evangelos A Theodorou. Likelihood training of schr\ odinger bridge using forward-backward sdes theory. ar Xiv preprint ar Xiv:2110.11291, 2021. Zhengdao Chen, Lei Chen, Soledad Villar, and Joan Bruna. Can graph neural networks count substructures? Advances in neural information processing systems, 33:10383 10395, 2020. Connor W Coley, Luke Rogers, William H Green, and Klavs F Jensen. Computer-assisted retrosynthesis based on molecular similarity. ACS central science, 3(12):1237 1245, 2017. Elias James Corey. The logic of chemical synthesis. 1991. Gabriele Corso, Hannes St ark, Bowen Jing, Regina Barzilay, and Tommi Jaakkola. Diffdock: Diffusion steps, twists, and turns for molecular docking. ar Xiv preprint ar Xiv:2210.01776, 2022. Hanjun Dai, Chengtao Li, Connor Coley, Bo Dai, and Le Song. Retrosynthesis prediction with conditional graph logic network. Advances in Neural Information Processing Systems, 32, 2019. Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion schr odinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34:17695 17709, 2021. Chenru Duan, Yuanqi Du, Haojun Jia, and Heather J Kulik. Accurate transition state generation with an object-aware equivariant elementary reaction diffusion model. ar Xiv preprint ar Xiv:2304.06174, 2023. Vijay Prakash Dwivedi and Xavier Bresson. A generalization of transformer networks to graphs. ar Xiv preprint ar Xiv:2012.09699, 2020. Pat Fitzsimmons, Jim Pitman, and Marc Yor. Markovian bridges: construction, palm interpretation, and splicing. In Seminar on Stochastic Processes, 1992, pp. 101 134. Springer, 1992. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840 6851, 2020. Lars Holdijk, Yuanqi Du, Ferry Hooft, Priyank Jaini, Bernd Ensing, and Max Welling. Path integral stochastic optimal control for sampling transition paths. ar Xiv preprint ar Xiv:2207.02149, 2022. Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forr e, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34:12454 12465, 2021. Emiel Hoogeboom, Vıctor Garcia Satorras, Cl ement Vignac, and Max Welling. Equivariant diffusion for molecule generation in 3d. In International conference on machine learning, pp. 8867 8887. PMLR, 2022. Published as a conference paper at ICLR 2024 Ilia Igashov, Hannes St ark, Cl ement Vignac, Victor Garcia Satorras, Pascal Frossard, Max Welling, Michael Bronstein, and Bruno Correia. Equivariant 3d-conditional diffusion models for molecular linker design. ar Xiv preprint ar Xiv:2210.05274, 2022. Yinjie Jiang, Yemin Yu, Ming Kong, Yu Mei, Luotian Yuan, Zhengxing Huang, Kun Kuang, Zhihua Wang, Huaxiu Yao, James Zou, et al. Artificial intelligence for retrosynthesis prediction. Engineering, 2022. Yinjie Jiang, WEI Ying, Fei Wu, Zhengxing Huang, Kun Kuang, and Zhihua Wang. Learning chemical rules of retrosynthesis with pre-training. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 5113 5121, 2023. Daniel D Johnson, Jacob Austin, Rianne van den Berg, and Daniel Tarlow. Beyond in-place corruption: Insertion and deletion in denoising probabilistic models. ar Xiv preprint ar Xiv:2107.07675, 2021. Eunji Kim, Dongseon Lee, Youngchun Kwon, Min Sik Park, and Youn-Suk Choi. Valid, plausible, and diverse retrosynthesis using tied two-way transformers with latent variables. Journal of Chemical Information and Modeling, 61(1):123 133, 2021. Seonghwan Kim, Jeheon Woo, and Woo Youn Kim. Diffusion-based generative ai for exploring transition states from 2d molecular graphs. 2023. Christian L eonard. A survey of the schr\ odinger problem and some of its connections with optimal transport. ar Xiv preprint ar Xiv:1308.0215, 2013. Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. ar Xiv preprint ar Xiv:2210.02747, 2022. Bowen Liu, Bharath Ramsundar, Prasad Kawthekar, Jade Shi, Joseph Gomes, Quang Luu Nguyen, Stephen Ho, Jack Sloane, Paul Wender, and Vijay Pande. Retrosynthetic reaction prediction using neural sequence-to-sequence models. ACS central science, 3(10):1103 1113, 2017. Guan-Horng Liu, Tianrong Chen, Oswin So, and Evangelos Theodorou. Deep generalized schr odinger bridge. Advances in Neural Information Processing Systems, 35:9374 9388, 2022. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101, 2017. Krzysztof Maziarz, Austin Tripp, Guoqing Liu, Megan Stanley, Shufang Xie, Piotr Gai nski, Philipp Seidl, and Marwin Segler. Re-evaluating retrosynthesis algorithms with syntheseus. ar Xiv preprint ar Xiv:2310.19796, 2023. Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp. 8162 8171. PMLR, 2021. Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018. Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International conference on machine learning, pp. 1530 1538. PMLR, 2015. Mikołaj Sacha, Mikołaj Błaz, Piotr Byrski, Paweł Dabrowski-Tumanski, Mikołaj Chrominski, Rafał Loska, Paweł Włodarczyk-Pruszynski, and Stanisław Jastrzebski. Molecule edit graph attention network: modeling chemical reactions as sequences of graph edits. Journal of Chemical Information and Modeling, 61(7):3273 3284, 2021. Nadine Schneider, Nikolaus Stiefl, and Gregory A Landrum. What s what: The (nearly) definitive guide to reaction role assignment. Journal of chemical information and modeling, 56(12):2336 2346, 2016. Arne Schneuing, Yuanqi Du, Charles Harris, Arian Jamasb, Ilia Igashov, Weitao Du, Tom Blundell, Pietro Li o, Carla Gomes, Max Welling, et al. Structure-based drug design with equivariant diffusion models. ar Xiv preprint ar Xiv:2210.13695, 2022. Published as a conference paper at ICLR 2024 Erwin Schr odinger. Sur la th eorie relativiste de l electron et l interpr etation de la m ecanique quantique. In Annales de l institut Henri Poincar e, volume 2, pp. 269 310, 1932. Philippe Schwaller, Teodoro Laino, Th eophile Gaudin, Peter Bolgar, Christopher A Hunter, Costas Bekas, and Alpha A Lee. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS central science, 5(9):1572 1583, 2019. Philippe Schwaller, Riccardo Petraglia, Valerio Zullo, Vishnu H Nair, Rico Andreas Haeuselmann, Riccardo Pisoni, Costas Bekas, Anna Iuliano, and Teodoro Laino. Predicting retrosynthetic pathways using transformer-based models and a hyper-graph exploration strategy. Chemical science, 11(12):3316 3325, 2020. Marwin HS Segler and Mark P Waller. Neural-symbolic machine learning for retrosynthesis and reaction prediction. Chemistry A European Journal, 23(25):5966 5971, 2017. Marwin HS Segler, Mike Preuss, and Mark P Waller. Planning chemical syntheses with deep neural networks and symbolic ai. Nature, 555(7698):604 610, 2018. Seung-Woo Seo, You Young Song, June Yong Yang, Seohui Bae, Hankook Lee, Jinwoo Shin, Sung Ju Hwang, and Eunho Yang. Gta: Graph truncated attention for retrosynthesis. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 531 539, 2021. Chence Shi, Minkai Xu, Hongyu Guo, Ming Zhang, and Jian Tang. A graph to graphs framework for retrosynthesis prediction. In International conference on machine learning, pp. 8818 8827. PMLR, 2020. Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256 2265. PMLR, 2015. Vignesh Ram Somnath, Charlotte Bunne, Connor Coley, Andreas Krause, and Regina Barzilay. Learning graph models for retrosynthesis prediction. Advances in Neural Information Processing Systems, 34:9405 9415, 2021. Vignesh Ram Somnath, Matteo Pariset, Ya-Ping Hsieh, Maria Rodriguez Martinez, Andreas Krause, and Charlotte Bunne. Aligned diffusion schr\ odinger bridges. ar Xiv preprint ar Xiv:2302.11419, 2023. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. ar Xiv preprint ar Xiv:2011.13456, 2020. Megan Stanley and Marwin Segler. Fake it until you make it? generative de novo design and virtual screening of synthesizable molecules. Current Opinion in Structural Biology, 82:102658, 2023. Felix Strieth-Kalthoff, Frederik Sandfort, Marwin HS Segler, and Frank Glorius. Machine learning the ropes: principles, applications and directions in synthetic chemistry. Chemical Society Reviews, 49(17):6154 6168, 2020. Ruoxi Sun, Hanjun Dai, Li Li, Steven Kearnes, and Bo Dai. Towards understanding retrosynthesis by energy-based models. Advances in Neural Information Processing Systems, 34:10186 10194, 2021. Igor V Tetko, Pavel Karpov, Ruud Van Deursen, and Guillaume Godin. State-of-the-art augmented nlp transformer models for direct and single-step retrosynthesis. Nature communications, 11(1): 5575, 2020. Morgan Thomas, Andreas Bender, and Chris de Graaf. Integrating structure-based approaches in generative molecular design. Current Opinion in Structural Biology, 79:102559, 2023. Zhengkai Tu and Connor W Coley. Permutation invariant graph-to-sequence model for templatefree retrosynthesis and reaction prediction. Journal of chemical information and modeling, 62 (15):3503 3513, 2022. Published as a conference paper at ICLR 2024 Zhengkai Tu, Thijs Stuyver, and Connor W Coley. Predictive chemistry: machine learning for reaction deployment, reaction development, and reaction discovery. Chemical Science, 2023. Francisco Vargas, Pierre Thodoroff, Austen Lamacraft, and Neil Lawrence. Solving schr odinger bridges via maximum likelihood. Entropy, 23(9):1134, 2021. Clement Vignac, Igor Krawczuk, Antoine Siraudin, Bohan Wang, Volkan Cevher, and Pascal Frossard. Digress: Discrete denoising diffusion for graph generation. ar Xiv preprint ar Xiv:2209.14734, 2022. Yue Wan, Chang-Yu Hsieh, Ben Liao, and Shengyu Zhang. Retroformer: Pushing the limits of end-to-end retrosynthesis transformer. In International Conference on Machine Learning, pp. 22475 22490. PMLR, 2022. Gefei Wang, Yuling Jiao, Qian Xu, Yang Wang, and Can Yang. Deep generative learning via schr odinger bridge. In International Conference on Machine Learning, pp. 10794 10804. PMLR, 2021. Joseph L Watson, David Juergens, Nathaniel R Bennett, Brian L Trippe, Jason Yim, Helen E Eisenach, Woody Ahern, Andrew J Borst, Robert J Ragotte, Lukas F Milles, et al. De novo design of protein structure and function with rfdiffusion. Nature, pp. 1 3, 2023. David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31 36, 1988. Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, and Dong Yu. Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023. Shuangjia Zheng, Jiahua Rao, Zhongyue Zhang, Jun Xu, and Yuedong Yang. Predicting retrosynthetic reactions using self-corrected transformer neural networks. Journal of chemical information and modeling, 60(1):47 55, 2019. Zipeng Zhong, Jie Song, Zunlei Feng, Tiantao Liu, Lingxiang Jia, Shaolun Yao, Min Wu, Tingjun Hou, and Mingli Song. Root-aligned smiles: a tight representation for chemical reaction prediction. Chemical Science, 13(31):9023 9034, 2022. Published as a conference paper at ICLR 2024 A.1 VARIATIONAL LOWER BOUND The log-likelihood of reactants y z T given product x z0 can be written as follows, log qθ(y|x) = log qθ(z T |z0) (14) = log Z dz1:T 1qθ(z1:T |z0) (15) = log Z dz1:T 1 t=1 qθ(zt|zt 1) (16) = log Z dz1:T 1 p(z1:T |z0, z T ) p(z1:T |z0, z T ) t=1 qθ(zt|zt 1) (17) = log Z dz1:T 1p(z1:T |z0, z T ) qθ(zt|zt 1) p(zt|zt 1, z T ). (18) Using Jensen s inequality (JI) and the fact that p(z1:T |z0, z T ) = p(z1:T 1|z0, z T ) ( ) we can derive a lower bound of this log-likelihood, log qθ(y|x) JI Z dz1:T 1p(z1:T |z0, z T ) log qθ(zt|zt 1) p(zt|zt 1, z T ) (19) ( ) = Z dz1:T 1p(z1:T 1|z0, z T ) log qθ(zt|zt 1) p(zt|zt 1, z T ) (20) Z dz1:T 1p(z1:T 1|z0, z T ) log qθ(zt|zt 1) p(zt|zt 1, z T ) (21) = L1(z0, z T ) + t=2 Lt(z0, z T ). (22) Here, the first term L1 can be written as follows, L1(z0, z T ) = Z dz1p(z1|z0, z T ) log qθ(z1|z0) p(z1|z0, z T ) = DKL (p(z1|z0, z T ) qθ(z1|z0)) . (23) Using Bayes rule (BR) and the Markov propery (MP) of the Markov bridge p, we can derive a similar expression for all intermediate terms Lt, Lt(z0, z T ) = Z dzt 1dztp(zt 1, zt|z0, z T ) log qθ(zt|zt 1) p(zt|zt 1, z T ) (24) BR = Z dzt 1dztp(zt 1|z0, z T )p(zt|zt 1, z0, z T ) log qθ(zt|zt 1) p(zt|zt 1, z T ) (25) MP = Z dzt 1dztp(zt 1|z0, z T )p(zt|zt 1, z T ) log qθ(zt|zt 1) p(zt|zt 1, z T ) (26) = Z dzt 1p(zt 1|z0, z T )DKL (p(zt|zt 1, z T ) qθ(zt|zt 1)) . (27) We combine L1 and Lt to obtain the final expression for the variational lower bound of the loglikelihood: log qθ(y|x) t=1 Ezt 1 p(zt 1|x,y)DKL (p(zt|zt 1, y) qθ(zt|zt 1)) (28) t=0 Ezt p(zt|x,y)DKL (p(zt+1|zt, y) qθ(zt+1|zt)) . (29) Published as a conference paper at ICLR 2024 Finally, we obtain the form provided in (7) by replacing the sum over all T terms with its unbiased estimator, log qθ(y|x) T Et U(0,...,T 1)Ezt p(zt|x,y)DKL (p(zt+1|zt, y) qθ(zt+1|zt)) . (30) A.2 CUMULATIVE TRANSITION MATRIX Qt A proof by induction. For t = 0, by definition, Q0 = Q0 and α0 = α0. Assume that for t > 0 Equation 8 holds, i.e. Qt = αt IK + (1 αt)y1 K and αt = Qt s=0 αs. Then for t + 1 we have Qt+1 = Qt+1Qt (31) = αt+1IK + (1 αt+1)y1 K αt IK + (1 αt)y1 K (32) = αt+1αt IK + (αt+1 αt+1αt + αt αt+1αt)y1 K (33) + (1 αt+1 αt + αt+1αt)y1 Ky1 K. (34) Note that y1 Ky1 K = y1 K and αt+1αt = αt+1. Therefore, we get Qt+1 = αt+1IK + (αt+1 αt+1 + αt αt+1 + 1 αt+1 αt + αt+1)y1 K (35) = αt+1IK + (1 αt+1)y1 K. (36) A.3 IMPLEMENTATION DETAILS A.3.1 NOISE SCHEDULE In all experiments, we use the cosine schedule (Nichol & Dhariwal, 2021) αt = cos 0.5π t/T + s with s = 0.008 and number of time steps T = 500. A.3.2 ADDITIONAL FEATURES We represent molecules as fully-connected graphs where node features are one-hot encoded atom types (sixteen atom types and additional dummy type) and edge features are covalent bond types (three bond types and additional none type). Besides, we use a global graph feature y which includes the normalized time step: y = t/T. Similarly to Vignac et al. (2022) we compute additional node features. For completeness, we provide the description of these features as in (Vignac et al., 2022) below. Cycles Rings of different sizes are crucial features of many bioactive molecules but graph neural networks are unable to detect them (Chen et al., 2020). We therefore add both global cycle counts yk, that capture the overall number of cycles in the graph, as well as local cycle counts Hk, that measure how many cycles each node belongs to. These quantities are computed for k-cycles up to size k = 5 and k = 6 for local and global counts, respectively. As proposed by Vignac et al. (2022)2, 2We use a corrected equation for H5. Published as a conference paper at ICLR 2024 we use the following equations that can be efficiently computed on GPUs 2 diag(A4) d (d 1n) Ad 2 diag(A5) 2T d 2d diag(A3) A diag(A3) + 5 diag(A3) y6 = Tr(A6) 3 Tr(A3 A3) + 9 A(A2 A2) F 6 diag(A2) diag(A4) + 6 Tr(A4) 4 Tr(A3) + 4 Tr(A2A2 A2) + 3 A3 F 12 Tr(A2 A2) + 4 Tr(A2) where A Rn n is the adjacency matrix and d Rn the node degree vector. Matrix T = A A2 indicates how many triangles each nodes shares with every other node. Spectral features Again following (Vignac et al., 2022), we include spectral features based on the eigenvalues and eigenvectors of the graph Laplacian. We use the multiplicity of eigenvalue 0 and the first five nonzero eigenvalues as graph-level features, and an indicator of the largest connected component (approximated based on the eigenvectors corresponding to zero eigenvalues) as well as two eigenvectors corresponding to the first nonzero eigenvalues as node-level features. Molecular features We also tried adding the molecular weight as a graph-level feature and each atom s valency as node-level features, following (Vignac et al., 2022). Models trained with these features were used everywhere except experiments reported in Tables 1 and 2. Later on, we found out that removing these features improves the performance on the validation set. Therefore, the final model from Tables 1 and 2 does not use molecular features. A.3.3 NEURAL NETWORK ARCHITECTURE We use a graph transformer network (Dwivedi & Bresson, 2020; Vignac et al., 2022) to approximate the final state of the Markov bridge process. The architecture of the network is provided in Figure 4A. First, node, edge and global features are passed through an encoder which is implemented as an MLP. Next, the encoded features are processed by a sequence of Graph Transformer Layers (GTL). As shown in Figure 4B, GTL first updates node features using self-attention and combines its output with edge features via Fi LM (Perez et al., 2018): Fi LM(X1, X2) = X1W1 + (X1W2) X2 + W2, (38) where W1 and W2 are learnable parameters. Then, edge features are updated using attention scores and global features. To update the global features, GTL combines encoded global features and node and edge features aggregated with PNA: PNA(X) = concat(max(X), min(X), mean(X), std(X))W , (39) where W is a learnable parameter. Finally, to obtain the final graph representation, updated node and edge features are passed through the decoder which is implemented as an MLP. A.3.4 TRAINING We train our models on a single GPU Tesla V100-PCIE-32GB using Adam W optimizer (Loshchilov & Hutter, 2017) with learning rate 0.0002 and batch size 64. We trained models for up to 800 epochs (which takes 72 hours) and then selected the best checkpoints based on top-5 accuracy (that was computed on a subset of the USPTO-50k validation set). Published as a conference paper at ICLR 2024 Encoder Decoder Graph Transformer Layer Graph Transformer Layer Outer product + scaling Flatten heads Flatten heads Fi LM Graph Transformer Layer Neural Network Architecture A Figure 4: Architecture of the network that approximates the final state of the Markov bridge process (A) and scheme of the Graph Transformer Layer (B). A.3.5 SAMPLING To establish the dependency between the model performance and the number of sampling steps T, we sampled reactants for the validation set (50 per input) using different numbers of steps T = 500/250/100/50 (with the same model trained for T = 500). As shown in Figure 5A, the results do not significantly change with up to x2 speed improvement. The performance starts degrading dramatically after only 5-fold reduction of the number of time steps. We additionally analysed the dependency of the performance on the number of samples. The statistical nature of the scoring method suggests that the performance should strongly correlate with the number of samples. However, as shown in Figure 5B, even for 5-fold reduction of the number of samples Retro Bridge achieves competitive performance. Despite the huge success and the wide applicability of diffusion models and similar score-based generative methods, a known bottleneck of such algorithms is sampling. We also cannot avoid iterative sampling and hence the inference procedure requires hundreds of forward passes. Our model reported in Table 1 takes about 1.3 seconds to sample reactants. For comparison, it takes 0.0064 and 0.0065 seconds for MEGAN and GLN respectively. We however note that 1 second for sampling a set of reactants for a given product is a completely feasible time for applications of models like ours. In particular, this speed does not prevent the use of our method as a component of a multistep retrosynthesis planning pipeline. Published as a conference paper at ICLR 2024 20 40 60 80 100 Number of samples N Metric value Top-k accuracy for different number of samples top-1 top-3 top-5 top-10 100 200 300 400 500 Number of steps T Metric value Top-k accuracy for different number of sampling steps top-1 top-3 top-5 top-10 Figure 5: (A) Dependency of the Retro Bridge performance on the number of sampling steps. Experiment performed on the Retro Brdige with CE loss on the validation set with 50 samples per product. (B) Dependency of the Retro Bridge performance on the number of samples. Experiment performed on the Retro Brdige with VLB loss on the test set with 500 sampling steps. Table 4: Top-k accuracy (exact match) on the USPTO-50k test dataset and classification of methods. TB: template-based; TF: template-free; G: graph-based; S: SMILES-based; FP: molecular fingerprints; PT: pre-trained on larger datasets. Horizontal lines separate methods that had access to significantly more training data (pre-trained) or operate on a limited vocabulary (template-based). The best performing methods in each category are highlighted in bold. Top-k accuracy Classification Model k = 1 k = 3 k = 5 k = 10 Templ. Data PT RSMILES Zhong et al. (2022) 56.3 79.2 86.2 91.0 TF S PMSR (Jiang et al., 2023) 62.0 78.4 82.9 86.8 TF S Retrosym Coley et al. (2017) 37.3 54.7 63.3 74.1 TB FP GLN (Dai et al., 2019) 52.5 74.7 81.2 87.9 TB G Graph Retro (Somnath et al., 2021) 53.7 68.3 72.2 75.5 TB G Local Retro (Chen & Jung, 2021) 52.6 76.0 84.4 90.6 TB G SCROP (Zheng et al., 2019) 43.7 60.0 65.2 68.7 TF S G2G (Shi et al., 2020) 48.9 67.6 72.5 75.5 TF G Aug. Transformer (Tetko et al., 2020) 48.3 73.4 77.4 TF S Dual TFaug (Sun et al., 2021) 53.6 70.7 74.6 77.0 TF S MEGAN (Sacha et al., 2021) 48.0 70.9 78.1 85.4 TF G Tied Transformer (Kim et al., 2021) 47.1 67.1 73.1 76.3 TF S GTAaug (Seo et al., 2021) 51.1 67.6 74.8 81.6 TF G/S Graph2SMILES (Tu & Coley, 2022) 52.9 66.5 70.0 72.9 TF G/S Retroformeraug (Wan et al., 2022) 52.9 68.2 72.5 76.4 TF G/S Retro Bridge (ours) 50.8 74.1 80.6 85.6 TF G A.4 EXTENDED RESULTS In Table 4, we provide an extended version of the exact-match results from Table 1 including additional models and methodological details. A.5 FORWARD REACTION PREDICTION We additionally trained and evaluated two models for forward reaction prediction using USPTO50k and USPTO-MIT datasets. We used the same hyperparameters as in other experiments. As shown in Table 5, our models (Forward Bridge) demonstrate comparable performance with other state-of-the-art methods. However, we stress that the probabilistic formulation is less applicable to Published as a conference paper at ICLR 2024 the forward reaction prediction task, and under certain assumptions this problem can be considered as completely deterministic. Therefore, we leave the study of capabilities of the Markov Bridge Model in the context of forward reaction prediction out of the scope of this work. Table 5: Top-k accuracy for forward reaction prediction. Dataset Model k = 1 k = 3 k = 5 USPTO-50k Forward Bridge (ours) 89.9 93.9 94.0 USPTO-MIT Forward Bridge (ours) 81.6 88.5 89.8 MEGAN 86.3 92.4 94.0 Mol. Transformer 88.7 93.1 94.2 score=0.52 score=0.06 score=0.03 score=0.63 score=0.04 score=0.03 score=0.46 score=0.19 score=0.13 score=0.97 score=0.01 score=0.01 score=0.99 score=0.01 score=0.43 score=0.4 score=0.49 score=0.23 score=0.07 score=0.66 score=0.18 score=0.07 score=0.01 score=0.01 score=0.58 score=0.26 score=0.12 Input product Sample 1 Sample 2 Sample 3 True reactants Figure 6: Examples of modeled reactants. We selected 10 random inputs from the USPTO-50k test set and for each of them we provide the top-3 Retro Bridge predictions along with their confidence scores. Two check marks indicate that sampled reactants are the same as the ground truth, and one check mark means that reactants are different, but Molecular Transformer (Schwaller et al., 2019) predicts the product molecule used as input.