# pretraining_and_finetuning_generative_flow_networks__befaeb46.pdf Published as a conference paper at ICLR 2024 PRE-TRAINING AND FINE-TUNING GENERATIVE FLOW NETWORKS Ling Pan1,2, Moksh Jain2,3, Kanika Madan2,3, Yoshua Bengio2,3,4 1Hong Kong University of Science and Technology 2Mila - Qu ebec AI Institute 3Universit e de Montr eal 4CIFAR AI Chair {penny.ling.pan}@gmail.com Generative Flow Networks (GFlow Nets) are amortized samplers that learn stochastic policies to sequentially generate compositional objects from a given unnormalized reward distribution. They can generate diverse sets of high-reward objects, which is an important consideration in scientific discovery tasks. However, as they are typically trained from a given extrinsic reward function, it remains an important open challenge about how to leverage the power of pretraining and train GFlow Nets in an unsupervised fashion for efficient adaptation to downstream tasks. Inspired by recent successes of unsupervised pre-training in various domains, we introduce a novel approach for reward-free pre-training of GFlow Nets. By framing the training as a self-supervised problem, we propose an outcome-conditioned GFlow Net (OC-GFN) that learns to explore the candidate space. Specifically, OC-GFN learns to reach any targeted outcomes, akin to goalconditioned policies in reinforcement learning. We show that the pre-trained OCGFN model can allow for a direct extraction of a policy capable of sampling from any new reward functions in downstream tasks. Nonetheless, adapting OC-GFN on a downstream task-specific reward involves an intractable marginalization over possible outcomes. We propose a novel way to approximate this marginalization by learning an amortized predictor enabling efficient fine-tuning. Extensive experimental results validate the efficacy of our approach, demonstrating the effectiveness of pre-training the OC-GFN, and its ability to swiftly adapt to downstream tasks and discover modes more efficiently. This work may serve as a foundation for further exploration of pre-training strategies in the context of GFlow Nets. 1 INTRODUCTION Unsupervised learning on large stores of data on the internet has resulted in significant advances in a variety of domains (Howard & Ruder, 2018; Devlin et al., 2018; Radford et al., 2019; Henaff, 2020). Pre-training with unsupervised objectives, such as next-token prediction in auto-regressive language models (Radford et al., 2019), on large-scale unlabelled data enables the development of models that can be effectively fine-tuned for novel tasks using few samples (Brown et al., 2020). Unsupervised learning at scale allows models to learn good representations, which enables data-efficient adaptation to novel tasks, and is central to the recent development towards larger models. On the other hand, in the context of amortized inference, Generative Flow Networks (GFlow Nets; Bengio et al., 2021) enable learning generative models for sampling from high-dimensional distributions over discrete compositional objects. Inspired by reinforcement learning (RL), GFlow Nets learn a stochastic policy to sequentially generate compositional objects with a probability proportional to a given reward, instead of reward maximization. Therefore, GFlow Nets have found success in applications to scientific discovery problems to generate high-quality and diverse candidates (Bengio et al., 2021; Jain et al., 2023a;b) as well as alternatives to Monte-Carlo Markov chains and variational inference for modeling Bayesian posteriors (Deleu et al., 2022; van Krieken et al., 2022). As a motivating example, consider the drug discovery pipeline. A GFlow Net can be trained to generate candidate RNA sequences that bind to a target using as reward the binding affinity of the RNA sequence with the target (Lorenz et al., 2011; Sinai et al., 2020) (which can be uncertain and imperfect based on the current understanding of the biological system and available experimental data). However, there is no way to efficiently adapt the GFlow Net to sample RNA sequences binding Published as a conference paper at ICLR 2024 to a different target of interest that reflects new properties. Unlike human intelligence, GFlow Nets currently lack the ability to leverage previously learned knowledge to efficiently adapt to new tasks with unseen reward functions, and need to be trained from scratch to learn a policy for matching the given extrinsic reward functions for different tasks. Inspired by the success of the unsupervised pre-training and fine-tuning paradigm in vision, language, and RL (Jaderberg et al., 2016; Sekar et al., 2020) domains, it is natural to ask how can this paradigm benefit GFlow Nets. As a step in this direction, we propose a fundamental approach to realize this paradigm for GFlow Nets. In this paper, we propose a novel method for reward-free unsupervised pre-training of GFlow Nets. We formulate the problem of pre-training GFlow Nets as a self-supervised problem of learning an outcome-conditioned GFlow Net (OC-GFN) which learns to reach any outcome (goal) as a functional understanding of the environment (akin to goal-conditioned RL (Chebotar et al., 2021)). The reward for training this OC-GFN is defined as the success of reaching the outcome. Due to the inherent sparse nature of this task-agnostic reward, it introduces critical challenges for efficient training OC-GFN in complex domains, particularly with higher-dimensional outcomes and long-horizon problems, since it is difficult to reach the outcome to get a meaningful reward signal. To tackle these challenges, we introduce a novel contrastive learning procedure to train OC-GFN to effectively handle such sparse rewards, which induces an implicit curriculum for efficient learning that resembles goal relabeling (Andrychowicz et al., 2017). To enable efficient learning in long-horizon tasks, we further propose a goal teleportation scheme to effectively propagate the learning signal to each step. A remarkable result is that one can directly convert this pre-trained OC-GFN to sample proportionally to a new reward function for downstream tasks (Bengio et al., 2023). It is worth noting that in principle, this can be achieved even without re-training the policy, which is usually required for fine-tuning in RL, as it only learns a reward-maximizing policy that may discard many useful information. Adapting the pre-trained OC-GFN model to a new reward function, however, involves an intractable marginalization over possible outcomes. We propose a novel alternative by learning a predictor that amortizes this marginalization, allowing efficient fine-tuning of the OC-GFN to downstream tasks. Our key contributions can be summarized as follows: We propose reward-free pre-training for GFlow Nets as training outcome-conditioned GFlow Net (OC-GFN) that learns to sample trajectories to reach any outcome. We investigate how to leverage the pre-trained OC-GFN model to adapt to downstream tasks with new rewards, and we also introduce an efficient method to learn an amortized predictor to approximate an intractable marginal required for fine-tuning the pre-trained OC-GFN model. Through extensive experiments on the Grid World domain, we empirically validate the efficacy of the proposed pre-training and fine-tuning paradigm. We also demonstrate its scalability to largerscale and challenging biological sequence design tasks, which achieves consistent and substantial improvements over strong baselines, especially in terms of diversity of the generated samples. 2 PRELIMINARIES Given a space of compositional objects X, and a non-negative reward function R : X 7 R+, the GFlow Net policy π is trained towards sampling objects x X from the distribution defined by R(x), i.e., π(x) R(x). The compositional objects are each sampled sequentially, with each step involving the addition of a building block a A (action space) to the current partially constructed object s S (state space). We can define a directed acyclic graph (DAG) G = {S, A} with the partially constructed objects forming nodes of G, including a special empty state s0. The edges are s s , where s is obtained by applying an action a A to s. The complete objects X are the terminal (childless) nodes in the DAG. The generation of an object x X corresponds to complete trajectories in the DAG starting from s0 and terminating in a terminal state sn = x, i.e., τ = (s0 x). We assign a non-negative weight, called state flow F(s) to each state s S. The forward policy PF (s |s) is a collection of distributions over the children of each state and the backward policy PB(s|s ) is a collection of distributions over the parents of each state. The forward policy induces a distribution over trajectories PF (τ), and the marginal likelihood of sampling a terminal state is given by marginalizing over trajectories terminating in x, P F (x) = P τ=(s0 x) PF (τ). GFlow Nets solve the problems of learning a parameterized policy PF ( | ; θ) such that P F (x) R(x). Training GFlow Nets. We use the detailed balance learning objective (DB; Bengio et al., 2023) to learn the parameterized policies and flows based on Eq. (1), which considers the flow consistency constraint in the edge level (i.e., the incoming flow for edge s s matches the outgoing flow). Published as a conference paper at ICLR 2024 When it is trained to completion, the objective yields the desired policy. s s A, F(s)PF (s |s) = F(s )PB(s|s ). (1) As in reinforcement learning, exploration is a key challenge in GFlow Nets. Generative Augmented Flow Networks (GAFlow Nets; Pan et al., 2023b) incorporate intrinsic intermediate rewards represented as augmented flows in the flow network to drive exploration, where ri(s s ) is specified by intrinsic motivation (Burda et al., 2018), yielding the following variant of detailed balance. s s A, F(s)PF (s |s) = F(s )PB(s|s ) + r(s s ). (2) 3 RELATED WORK GFlow Nets. While several learning objectives have been proposed for improving credit assignment and sample efficiency in GFlow Nets (Bengio et al., 2021; 2023) such as detailed balance (Bengio et al., 2023), sub-trajectory balance (Madan et al., 2022) and forward-looking objectives (Pan et al., 2023a), they need to be trained from scratch with a given reward function, which may limit its applicability to more practical problems. Owing to their flexibility, GFlow Nets have been applied to wide range of problems where diverse high-quality candidates are need, such as molecule generation (Bengio et al., 2021), biological sequence design (Jain et al., 2022), combinatorial (Zhang et al., 2023a;b), and multi-objective optimization (Jain et al., 2023b). There have also been recent efforts in generalizing GFlow Nets to handle stochastic environments including Stochastic GFlow Nets (Pan et al., 2023c) and Distributional GFlow Nets (Zhang et al., 2023c), which are effective in handling stochasticity in transition dynamics and rewards; and also improving training efficiency of GFlow Nets inspired by reinforcement learning (Pan et al., 2020; Lau et al., 2024) and evolutionary algorithms (Ikram et al., 2024). Unsupervised Pre-Training in Reinforcement Learning. Following the progress in language modeling and computer vision, there has been growing interest in pre-training reinforcement learning (RL) agents in an unsupervised stage without access to task-specific rewards for learning representations. Agents typically learn a set of different skills (Eysenbach et al., 2018; Hansen et al., 2019; Zhao et al., 2021; Liu & Abbeel, 2021), and then fine-tune the learned policy to downstream tasks. Contrary to reward-maximization, GFlow Nets learn a stochastic policy to match the reward distribution. As we show in the next section, such a learned policy can be adapted to a new reward function even without re-training (Bengio et al., 2023). Goal-Conditioned Reinforcement Learning. Different from standard reinforcement learning (RL) methods that learn policies or value functions based on observations, goal-conditioned RL (Kaelbling, 1993) also take goals into consideration by augmenting the observations with an additional input of the goal and have been studied in a number of prior works (Schaul et al., 2015; Nair et al., 2018; Veeriah et al., 2018; Eysenbach et al., 2020). Goal-conditioned RL is trained to greedily achieve different goals specified explicitly as input, making it possible for agents to generalize their learned abilities across different environments. 4 PRE-TRAINING AND FINE-TUNING GENERATIVE FLOW NETWORKS Figure 1: The unsupervised pre-training phase of outcome-conditioned GFlow Net. In the original formulation, a GFlow Net need to be trained from scratch whenever it is faced with a previously unseen reward function (with consistent state and action spaces, as motivated in Section 1). In this section, we aim to leverage the power of pre-training in GFlow Nets for efficient adaptation to downstream tasks. To tackle this important challenge, we propose a novel approach to frame the problem of pre-training GFlow Nets as a self-supervised problem, by training an outcomeconditioned GFlow Net that learns to reach any input terminal state (outcome) without task-specific rewards. Then, we propose to leverage the power of the pre-trained GFlow Net model for efficiently fine-tuning it to downstream tasks with a new reward function. 4.1 UNSUPERVISED PRE-TRAINING STAGE As GFlow Nets are typically learned with a given reward function, it remains an open challenge for how to pre-train them in a reward-free fashion. We propose a novel method for unsupervised Published as a conference paper at ICLR 2024 Algorithm 1 Reward-free Pre-training of Unsupervised GFlow Nets. Require: GAFlow Net F U(s), P U F (s |s), P U B (s|s ); fixed target network ϕ; predictor network ϕ; Outcome-conditioned GFlow Net (OC-GFN) F C(s|y), P C F (s |s, y), P C B (s|s , y) 1: for each training step t = 1 to T do 2: Collect a trajectory τ + = {s+ 0 s+ n } with P U F 3: // Update the outcome-conditioned GFlow Net model 4: y+ f(s+), R(s+ n |y+) 1{f(s+ n ) = y+} 1 5: Update OC-GFN towards minimizing Eq. (5) with τ + and R(s+ n |y+) 6: Collect a trajectory τ = {s 0 s n } with P C F conditioned on y+ 7: y f(s ), R(s n |y+) 1{f(s n ) = y+} 8: Update OC-GFN towards minimizing Eq. (5) with τ and R(s n |y+) 9: // Update the unconditional GAFlow Net model 10: Update GAFlow Net towards optimizing Eq. (2) based on τ + and ri || ϕ(s) ϕ(s)||2 11: Update the predictor network ϕ towards minimizing || ϕ(s) ϕ(s)||2 pre-training of GFlow Nets without task-specific rewards. We formulate the problem of pre-training GFlow Nets as a self-supervised problem of learning an outcome-conditioned GFlow Net (OC-GFN) which learns to reach any input target outcomes, inspired by the success of goal-conditioned reinforcement learning in generalizing to a variety of tasks when it is tasked with different goals (Fang et al., 2022). We denote the outcome as y = f(s), where f is an identity function and s is a terminal state, so the space of outcomes is the same as the state space X. What makes OC-GFN special is that, when fully trained, given a reward R a posterior as a function r of the outcome, i.e., R(s) = r(y), one can adapt the OC-GFN to sample from this reward, which can generalize to tasks with different rewards. 4.1.1 OUTCOME-CONDITIONED GFLOWNETS (OC-GFN) We extend the idea of flow functions and policies in GFlow Nets to OC-GFN that can generalize to different outcomes y in the outcome space, which is trained to achieve specified y. OC-GFN can be realized by providing an additional input y to the flows and policies, resulting in outcomeconditioned forward and backward policies PF (s |s, y) and PB(s|s , y), and flows F(s|y). The resulting learning objective for OC-GFN for intermediate states is shown in Eq. (3). F(s|y)PF (s |s, y) = F(s |y)PB(s|s , y). (3) Outcome generation We can train OC-GFN by conditioning outcomes-conditioned flows and policies on a specified outcome y, and we study how to generate them autotelically. It is worth noting that we need to train it with full-support over y. We propose to leverage GAFN (Pan et al., 2023b) with augmented flows that enable efficient reward-free exploration purely by intrinsic motivation (Burda et al., 2018). In practice, we generate diverse outcomes y with GAFN, and provide them to OC-GFN to sample an outcome-conditioned trajectory τ = (s0, , sn). The effect of the GAFN is studied in Appendix C.2, which is critical in large and high-dimensional problems as it affects the efficiency of generating diverse outcomes. The resulting conditional reward is given as R(sn|y) = 1{f(sn) = y}. Thus, OC-GFN receives a zero reward if it fails to reach the target outcome, and a positive reward otherwise, which results in learning an outcome-achieving policy. Contrastive training However, it can be challenging to efficiently train OC-GFN in problems with large outcome spaces. This is because it can be hard to actually achieve the outcome and obtain a meaningful learning signal owing to the sparse nature of the rewards most of the conditional rewards R(sn|y) will be zero if f(sn) = y when OC-GFN fails to reach the target. To alleviate this we propose a contrastive learning objective for training the OC-GFN. After sampling a trajectory τ + = {s+ 0 s+ n } from unconditional GAFN, we first train an OC-GFN based on this off-policy trajectory by assuming it has the ability to achieve y+ = s+ n when conditioned on y+. Note that the resulting conditional reward R(s+ n |y+) = 1{f(s+ n ) = y+} 1 in this case, as all correspond to successful experiences that provide meaningful learning signals to OCGFN. We then sample another on-policy trajectory τ = {s 0 s n } from OC-GFN by conditioning it on y+, and evaluate the conditional reward by R(s n |y+) = 1{f(s n ) = y+}. Although most of R(s n |y+) can be zero in large outcome spaces during early learning, we provide sufficient successful experiences in the initial phase, which is also related to goal relabeling (Andrychowicz et al., 2017). This can be viewed as an implicit curriculum for improving the training of OC-GFN. Published as a conference paper at ICLR 2024 Algorithm 2 Supervised Fine-Tuning of Outcome-Conditioned GFlow Nets. 1: Initialize the numerator network N(s |s) and the GFlow Net-like predictor network Q(y|s , s) 2: Obtain the pre-trained outcome-conditioned state F(s|y) and forward policy PF (s |s, y) 3: for each training step t = 1 to T do 4: Collect a trajectory τ = {s0 sn} with N(s |s) 5: Sample outcomes y from a tempered/ϵ-greedy version of Q( |s , s) 6: Update N and Q towards minimizing Eq. (9) 7: Compute the policy P r F (s |s) = N(s |s)/P Outcome teleportation The contrastive training paradigm can significantly improve learning efficiency by providing a bunch of successful trajectories with meaningful learning signals, which tackles the particular challenge of sparse rewards during learning. However, the agent may still suffer from poor learning efficiency in long-horizon tasks, as it cannot effectively propagate the success/failure signal back to each step. We propose a novel technique, outcome teleportation, for further improving the learning efficiency of OC-GFN as in Eq. (4), which considers the terminal reward R(x|y) for every transition (noting the binary nature of the rewards R). F(s|y)PF (s |s, y) = F(s |y)PB(s|s , y)R(x|y). (4) This formulation can efficiently propagate the guidance signal to each transition in the trajectory, which can significantly improve learning efficiency, particularly in high-dimensional outcome spaces, as investigated in Section 5.2. It can be interpreted as a form of reward decomposition in outcome-conditioned tasks with binary rewards. In practice, we train OC-GFN by minimizing the following loss function LOC GF N(τ, y) in log-scale obtained from Eq. (4), i.e., X s s τ (log F(s|y) + log PF (s |s, y) log F(s |y) log PB(s|s , y) log R(x|y))2 . (5) Theoretical justification. We now justify that when OC-GFN is trained to completion, it can successfully reach any specified outcome. The proof can be found in Appendix B.1. Proposition 4.1. If LOC GF N(τ, y) = 0 for all trajectories τ and outcomes y, then the outcomeconditioned forward policy PF (s |s, y) can successfully reach the target outcome y. The resulting procedure for the reward-free unsupervised pre-training of OC-GFN is summarized in Algorithm 1 and illustrated Figure 1. 4.2 SUPERVISED FINE-TUNING STAGE In this section, we study how to leverage the pre-trained OC-GFN model and adapt it for downstream tasks with new reward functions. A remarkable aspect of GFlow Nets is the ability to demonstrate the adaptability in generating a task-specific policy. By utilizing the pre-trained OC-GFN model with outcome-conditioned flows F(s|y) and policies PF (s |s, y), we can directly obtain a policy that samples according to a new task-specific reward function R(s) = r(y) according to Eq. (6), which is based on (Bengio et al., 2023). A detailed analysis for this can be found in Appendix B.2. P r F (s |s) = y r(y)F(s|y)PF (s |s, y) P y r(y)F(s|y) (6) Eq. (6) serves as the foundation for converting a pre-trained OC-GFN model to handle downstream tasks with new and even out-of-distribution rewards. Intriguingly, this conversion can be achieved without any re-training for OC-GFN on downstream tasks. This sets it apart from the typical finetuning process in reinforcement learning, which typically requires re-training to adapt a policy, since they generally learn reward-maximizing policies that may discard valuable information. Directly estimating the above summation often necessitates the use of Monte-Carlo averaging for making each decision. Yet, this can be computationally expensive in high-dimensional outcome spaces if we need to calculate this marginalization at each decision-making step, which leads to slow thinking. To improve its efficiency in complex scenarios, it is essential to develop strategies that are both efficient while maintaining accurate estimations. Published as a conference paper at ICLR 2024 Outcomeconditioned extrinsic reward Outcome-conditioned GFN Figure 2: Left: Convert the outcome-conditioned GFlow Net to downstream tasks without re-training the networks. Right: An efficient amortized predictor in the supervised fine-tuning phase. Learning an Amortized Predictor In this section, we propose a novel approach to approximate this marginal by learning an amortized predictor. Concretely, we propose to estimate the intractable sums in the numerator in Eq. (6) by a numerator network N(s |s) P y r(y)F(s|y)PF (s |s, y). This would allow us to efficiently estimate the intractable sum with the help of N(s |s) that directly estimates N( |s) for any state s, which could benefit from the generalizable structure of outcomes with neural networks. We can also obtain the corresponding policy by P r F (s |s) = N(s |s)/P s N(s |s). For learning the numerator network N( |s), we need to have a sampling policy for sampling outcomes y given states s and next states s , which can be achieved by Q(y|s , s) = r(y)F(s|y)PF (s |s, y) N(s |s) . (7) Therefore, we have the following constraint according to Eq. (7) N(s |s)Q(y|s , s) = r(y)F(s|y)PF (s |s, y), (8) based on which we derive the corresponding loss function by minimizing their difference by training in the log domain, i.e., Lamortized = (log N(s |s) + log Q(y|s , s) log r(y) log F(s|y) log PF (s |s, y))2 . (9) Theoretical justification. In Proposition 4.2, we show that N(s |s) can correctly estimate the summation when it is trained to completion and the distribution of outcomes y has full support. Proposition 4.2. Suppose that (s, s , y), Lamortized(s, s , y) = 0, then the amortized predictor N(s |s) estimates P y r(y)F(s|y)PF (s |s, y). The proof can be found in Appendix B.3. Proposition 4.2 justifies the use of the numerator network as an efficient alternative for estimating the computationally intractable summation in Eq. (6). Empirical validation. We now investigate the converted/learned sampling policy in the standard Grid World domain (Bengio et al., 2021), which has a multi-modal reward function. More details about the setup is in Appendix C.1 due to space limitation. We visualize the last 2 105 samples from different baselines including training GFN from scratch, OC-GFN with the Monte Carlo-based estimation with Eq. (6) and the amortized marginalization approach with Eq. (9). As shown in Figure 3(b), directly training GFN with trajectory balance (Malkin et al., 2022) can suffer from the mode collapse problem and fail to discover all modes of the target distribution in Figure 3(a). On the contrary, we can directly obtain a policy from the pre-trained OC-GFN model that samples proportionally to the target rewards as shown in Figure 3(c), while fine-tuning OCGFN with the amortized inference could also match the target distribution as in Figure 3(d), which validates its effectiveness in estimating the marginal. Practical implementation. In practice, we can train the amortized predictor N( |s) and the sampling policy Q( |s , s) in a GFlow Net-like procedure. We sample (s, s ) with N by interacting with the environment. This can also be realized by sampling from the large and diverse dataset D obtained in the pre-training stage. We then sample outcomes y from the sampling policy Q( |s , s), which can incorporate its tempered version or with ϵ-greedy exploration (Bengio et al., 2021) for obtaining rich distributions of y. We can then learn the amortized predictor and the sampling policy by Published as a conference paper at ICLR 2024 (d) Figure 3: Distribution of 2 105 samples from different baselines. (a) Target distribution. (b) GFN (from scratch). (c) OC-GFN (Monte Carlo-based). (d) OC-GFN (Amortizer-based). one of these sampled experiences. Finally, we can derive the policy that will be converted to downstream tasks by P r F (s |s) = N(s |s)/P s N(s |s). The resulting training algorithm is summarized in Algorithm 2 and the right part in Figure 2. 5 EXPERIMENTS In this section, we conduct extensive experiments to better understand the effectiveness of our approach and aim to answer the following key questions: (i) How do outcome-conditioned GFlow Nets (OC-GFN) perform in the reward-free unsupervised pre-training stage? (ii) What is the effect of key modules? (iii) Can OC-GFN transfer to downstream tasks efficiently? (iv) Can they scale to complex and practical scenarios like biological sequence design? 5.1 GRIDWORLD We first conduct a series of experiments on Grid World (Bengio et al., 2021) to understand the effectiveness of the proposed approach. In the reward-free unsupervised pre-training phase, we train a (unconditional) GAFN (Pan et al., 2023b) and an outcome-conditioned GFN (OC-GFN) on a map without task-specific rewards, where GAFN is trained purely from self-supervised intrinsic rewards according to Algorithm 1. We investigate how well OC-GFN learns in the unsupervised pre-training stage, while the supervised fine-tuning stage has been discussed in Section 4.2. Each algorithm is run for 3 different seeds and the mean and standard deviation are reported. A detailed description of the setup and hyperparameters can be found in Appendix C.1. Outcome distribution. We first evaluate the quality of the exploratory data collected by GAFN, as it is essential for training OC-GFN with a rich distribution of outcomes. We demonstrate the sample distribution from GAFN in Figure 4(a), which has great diversity and coverage, and validates its effectiveness in collecting unlabeled exploratory trajectories for providing diverse target outcomes for OC-GFN to learn from. 0 2500 5000 7500 10000 12500 15000 17500 20000 Update Success rate (%) OC-GFN OC-GFN w/o OT OC-GFN w/o OT & CT 0 2500 5000 7500 10000 12500 15000 17500 20000 Update Success rate (%) OC-GFN OC-GFN w/o OT OC-GFN w/o OT & CT 0 2500 5000 7500 10000 12500 15000 17500 20000 Update Success rate (%) OC-GFN OC-GFN w/o OT OC-GFN w/o OT & CT (d) Figure 4: Results in Grid World in different with different scales of the task. (a) Outcome distribution. (b)-(d) Success rate of OC-GFN and its variants in small, medium, and large maps, respectively. Outcome reaching performance. We then investigate the key designs in pre-training OC-GFN by analyzing its success rate for achieving target outcomes. We also ablate key components to investigate the effect of contrastive training (CT) and outcome teleportation (OT). The success rates of OC-GFN and its variants are summarized in Figures 4(b)-(d) in different sizes of the map from small to large. As shown, OC-GFN can successfully learn to reach specified outcomes with a success rate of nearly 100% at the end of training in maps with different sizes. Disabling the outcome teleportation component (OC-GFN w/o OT) leads to lower sample efficiency, in particular in larger maps. Further deactivating the contrastive training process (OC-GFN w/o OT & CT) fails to learn well as the size of the outcome space grows, since the agent can hardly collect successful trajectories for meaningful updates. This indicates that both contrastive training and outcome teleportation are important for successfully training OC-GFN in large outcome spaces. It is also worth noting that Published as a conference paper at ICLR 2024 outcome teleportation can significantly boost learning efficiency in problems with a combinatorial explosion of outcome spaces (e.g., sequence generation in the following sections). Figure 5: Behaviors of OCGFN. Outcome-conditioned behaviors. We visualize the learned behaviors of OC-GFN in Figure 5, where the red and green squares denote the starting point and the goal, respectively. We find that OC-GFN can not only reach different outcomes with a high success rate, but also discover diverse trajectories leading to a specified outcome y with the forward policy PF . OC-GFN is able to generate diverse trajectories, which is different from typical goal-conditioned RL approaches (Schaul et al., 2015) that usually only learn a single solution to the goal state. This ability can be very helpful for generalizing to downstream tasks with similar structure but subtle changes (e.g., obstacles in the maze) (Kumar et al., 2020). Applicability to other objectives. We now investigate the versatility of our approach by building it upon another recent GFlow Nets learning objective with Sub TB (Madan et al., 2022). We investigate the outcome-reaching performance of OC-Sub TB in the pre-training stage with a medium map as in Figure 6(a), while the learned sample distribution by fine-tuning the outcome-conditioned flows 0 500 1000 1500 2000 2500 Update Success rate (%) OC-Sub TB OC-Sub TB w/o OT OC-Sub TB w/o OT & CT (b) Figure 6: OC-Sub TB in Grid World. and policies of OC-Sub TB with our amortized predictor is shown in Figure 6(b). As shown, OC-Sub TB can successfully learn to reach target outcomes, where the proposed contrastive training and outcome teleportation method achieve consistent gains in efficiency. Fine-tuning OC-Sub TB can also discover all the modes, as opposed to training a GFN (Malkin et al., 2022) from scratch (Figure 3). A more detailed discussion is in Appendix B.2. 5.2 BIT SEQUENCE GENERATION We study the bit sequence generation task (Malkin et al., 2022), where the agent generates sequences of length n. We follow the same procedure for pre-training OC-GFN without task-specific rewards as in Section 5.1 with more details in Appendix C.1. Analysis of the unsupervised pre-training stage. We first analyze how well OC-GFN learns in the unsupervised pre-training stage by investigating its success rate for achieving specified outcomes in bit sequence generation problems with different scales including small, medium, and large. As shown in Figures 7(a)-(c), OC-GFN can successfully reach outcomes, leading to a high success rate as summarized in Figure 7(d). It is also worth noting that it fails to learn when the outcome teleportation module is disabled due to the particularly large outcome spaces. In such highly challenging scenarios, the contrastive training paradigm alone does not lead to efficient learning. 0 0.2 0.4 0.6 0.8 1.0 Updates ( 105) Success rate (%) OC-GFN OC-GFN w/o OT OC-GFN w/o OT & CT 0 0.2 0.4 0.6 0.8 1.0 Updates ( 105) Success rate (%) OC-GFN OC-GFN w/o OT OC-GFN w/o OT & CT (b) Medium. 0 1 2 3 4 5 6 7 Updates ( 105) Success rate (%) OC-GFN OC-GFN w/o OT OC-GFN w/o OT & CT Size Success rate Small 98.37 0.33% Medium 97.22 1.70% Large 93.75 2.04% (d) Summary. Figure 7: Success rates in the bit sequence generation task with different scales of the task. 0 1.0 2.0 3.0 4.0 5.0 Updates ( 105) Number of modes (normalized) OC-GFN GFN (from scratch) MCMC DQN Figure 8: Results in bit sequence generation. Analysis of the Supervised Fine-Tuning Stage. After we justified the effectiveness of training OC-GFN in the unsupervised pre-training stage, we study whether fine-tuning the model with the amortized approach can enable faster mode discovery when adapting to downstream tasks (as described in Appendix C.1). We compare the proposed approach against strong baselines including training a GFN from scratch (Bengio et al., 2023), Metropolis-Hastings-MCMC (Dai et al., 2020), and Deep Q-Networks (DQN) (Mnih et al., 2015). We evaluate each method in terms of the number of modes discovered during the course of training as in previous works (Malkin et al., 2022). We summarize the normalized (between 0 and 1 to facilitate comparison across tasks) number of modes averaged over each downstream task in Figure 8, while results for each individual downstream task can be found in Appendix C.5 due to space limitation. Published as a conference paper at ICLR 2024 We find that OC-GFN significantly outperforms baselines in mode discovery in both efficiency and performance, while DQN gets stuck and does not discover diverse modes due to the rewardmaximizing nature, and MCMC fails to perform well in problems with larger state spaces. 5.3 TF BIND GENERATION We study a more practical task of generating DNA sequences with high binding activity with targeted transcription factors (Jain et al., 2022). We consider 30 different downstream tasks studied in (Barrera et al., 2016), which conducted biological experiments to determine the binding properties between a range of transcription factors and every conceivable DNA sequence. Analysis of the unsupervised pre-training stage. We analyze how well OC-GFN learns by evaluating its success rate. Figure 9 shows that OC-GFN can successfully achieve the targeted goals with a success rate of nearly 100% in the practical task of generating DNA sequences. 0 0.2 0.4 0.6 0.8 1.0 Updates ( 105) Success rate (%) Figure 9: Success rate in TF Bind generation. Analysis of the supervised fine-tuning stage. We investigate how well OC-GFN can transfer to downstream tasks for generating TF Bind 8 sequences. We tune the hyperparameters on the task PAX3 R270C R1, and evaluate the baselines on the other 29 downstream tasks. We investigate the learning efficiency and performance in terms of the number of modes and the top-K scores during the course of training. The results in 2 downstream tasks are shown in Figure 10, with the full results in Appendix C.6 due to space limitation. Figures 10(a)-(b) illustrate the number of discovered modes while Figures 10(c)-(d) shows the top-K scores. We find that the proposed approach discovers more diverse modes faster and achieves higher top-K (K = 100) scores compared to baselines. We summarize the rank of each baseline in all 30 downstream tasks (Zhang et al., 2021) in Appendix C.6, where OC-GFN ranks highest compared with other methods. We further visualize in Figure 10(e) the t-SNE plots of the TF Bind 8 sequences discovered by transferring OC-GFN with the amortized approach and the more expensive but competitive training of a GFN from scratch. As shown, training a GFN from scratch only focuses on limited regions while the OC-GFN has a greater coverage. 0 50 100 150 200 250 300 Updates Number of modes OC-GFN GFN (from scratch) MCMC DQN 0 50 100 150 200 250 300 Updates Number of modes OC-GFN GFN (from scratch) MCMC DQN 0 1000 2000 3000 4000 5000 Updates Top-K score OC-GFN GFN (from scratch) MCMC DQN 0 1000 2000 3000 4000 5000 Updates Top-K score OC-GFN GFN (from scratch) MCMC DQN OC-GFN GFN (from scratch) Figure 10: Results in the TF Bind sequence generation task in different downstream tasks. 5.4 RNA GENERATION 0 1.0 2.0 3.0 4.0 5.0 Updates ( 105) Success rate (%) 0 1000 2000 3000 4000 5000 Updates Number of modes (normalized) OC-GFN GFN (from scratch) MCMC DQN Figure 11: Results in RNA generation. Left: success rate. Right: number of modes (normalized). We now study a larger task of generating RNA sequences that bind to a given target introduced in (Lorenz et al., 2011). We follow the same procedure as in Section 5.3, where details are in Appendix C.1. We consider four different downstream tasks from Vienna RNA (Lorenz et al., 2011), each considering the binding energy with a different target as a reward. The left part in Figure 11 shows that OC-GFN achieves a success rate of almost 100% at the end of the pretraining stage. The right part in Figure 11 summarizes the averaged normalized (between 0 and 1 to facilitate comparison across tasks) performance in terms of the number of modes averaged over four downstream tasks, where the performance for each individual task is shown in Appendix C.7. We observe that OC-GFN is able to achieve much higher diversity than baselines, indicating that the pre-training phase enables the OC-GFN to explore the state space much more efficiently. 5.5 ANTIMICROBIAL PEPTIDE GENERATION To demonstrate the scalability of our approach to even more challenging and complex scenarios, we also evaluate it on the biological task of generating antimicrobial peptides (AMP) with lengths 50 (Jain et al., 2022) (resulting in an outcome space of 2050 candidates). Figure 12(a) demonstrates Published as a conference paper at ICLR 2024 0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Updates ( 106) Success rate (%) (a) Success rate. 0 2500 5000 7500 10000 12500 15000 17500 20000 Updates Number of modes OC-GFN GFN (from scratch) MCMC DQN (b) Number of modes. Figure 12: Results in AMP generation. the success rate of OC-GFN for achieving target outcomes, which can still reach a high success rate in particularly large outcome spaces with our novel training paradigm according to contrastive learning and fast goal propagation. Figure 12(b) shows the number of modes discovered, where OC-GFN provides consistent improvements, which validates the efficacy of OC-GFN to successfully scale to much larger-scale tasks. 6 CONCLUSION We propose a novel method for unsupervised pre-training of GFlow Nets through an outcomeconditioned GFlow Net, coupled with a new approach to efficiently fine-tune the pre-trained model for downstream tasks. Our work opens the door for GFlow Nets to be pre-trained for fine-tuning for downstream tasks. Empirical results on the standard Grid World domain validate the effectiveness of the proposed approach in successfully achieving targeted outcomes and the efficiency of the amortized predictor. We also conduct extensive experiments in the more complex and challenging biological sequence design tasks to demonstrate its practical scalability. Our method greatly improves learning performance compared with strong baselines including training a GFlow Net from scratch, particularly in tasks which are challenging for GFlow Nets to learn. ACKNOWLEDGMENTS The authors acknowledge funding from CIFAR, Genentech, Samsung, and IBM. ETHICS STATEMENT We do not foresee any immediate negative societal impact of our work. Our work is motivated by the need for ways to accelerate scientific problems. However, we do note that there is a potential risk of dual use of the technology by nefarious actors (Urbina et al., 2022), as with all work around this topic. REPRODUCIBILITY All details of our experiments are discussed in Appendix C with a detailed description of the task, network architectures, hyper-parameters for baselines, and setup. We implement all baselines and environments based on open-source repositories described in Appendix C. The proofs of propositions can be found in Appendix B. Limitations are discussed in Appendix D. Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob Mc Grew, Josh Tobin, Open AI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. Advances in neural information processing systems, 30, 2017. Luis A Barrera, Anastasia Vedenko, Jesse V Kurland, Julia M Rogers, Stephen S Gisselbrecht, Elizabeth J Rossin, Jaie Woodard, Luca Mariani, Kian Hong Kock, Sachi Inukai, et al. Survey of variation in human transcription factors reveals prevalent dna binding changes. Science, 351 (6280):1450 1454, 2016. Emmanuel Bengio, Moksh Jain, Maksym Korablyov, Doina Precup, and Yoshua Bengio. Flow network based generative models for non-iterative diverse candidate generation. Neural Information Processing Systems (Neur IPS), 2021. Yoshua Bengio, Salem Lahlou, Tristan Deleu, Edward J. Hu, Mo Tiwari, and Emmanuel Bengio. Gflownet foundations. Journal of Machine Learning Research, 24(210):1 55, 2023. URL http: //jmlr.org/papers/v24/22-0364.html. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877 1901, 2020. Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. ar Xiv preprint ar Xiv:1810.12894, 2018. Published as a conference paper at ICLR 2024 Yevgen Chebotar, Karol Hausman, Yao Lu, Ted Xiao, Dmitry Kalashnikov, Jake Varley, Alex Irpan, Benjamin Eysenbach, Ryan Julian, Chelsea Finn, et al. Actionable models: Unsupervised offline reinforcement learning of robotic skills. ar Xiv preprint ar Xiv:2104.07749, 2021. Hanjun Dai, Rishabh Singh, Bo Dai, Charles Sutton, and Dale Schuurmans. Learning discrete energy-based models via auxiliary-variable local exploration. Advances in Neural Information Processing Systems, 33:10443 10455, 2020. Tristan Deleu, Ant onio G ois, Chris Emezue, Mansi Rankawat, Simon Lacoste-Julien, Stefan Bauer, and Yoshua Bengio. Bayesian structure learning with generative flow networks. Uncertainty in Artificial Intelligence (UAI), 2022. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018. Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. International Conference on Learning Representations (ICLR), 2018. Benjamin Eysenbach, Ruslan Salakhutdinov, and Sergey Levine. C-learning: Learning to achieve goals via recursive classification. ar Xiv preprint ar Xiv:2011.08909, 2020. Kuan Fang, Patrick Yin, Ashvin Nair, and Sergey Levine. Planning to practice: Efficient online fine-tuning by composing goals in latent space. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4076 4083. IEEE, 2022. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. International Conference on Machine Learning (ICML), 2018. Steven Hansen, Will Dabney, Andre Barreto, Tom Van de Wiele, David Warde-Farley, and Volodymyr Mnih. Fast task inference with variational intrinsic successor features. ar Xiv preprint ar Xiv:1906.05030, 2019. Olivier Henaff. Data-efficient image recognition with contrastive predictive coding. In International conference on machine learning, pp. 4182 4192. PMLR, 2020. Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 328 339, 2018. Zarif Ikram, Ling Pan, and Dianbo Liu. Evolution guided generative flow networks. ar Xiv preprint ar Xiv:2402.02186, 2024. Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. In International Conference on Learning Representations, 2016. Moksh Jain, Emmanuel Bengio, Alex Hernandez-Garcia, Jarrid Rector-Brooks, Bonaventure F.P. Dossou, Chanakya Ekbote, Jie Fu, Tianyu Zhang, Micheal Kilgour, Dinghuai Zhang, Lena Simine, Payel Das, and Yoshua Bengio. Biological sequence design with GFlow Nets. International Conference on Machine Learning (ICML), 2022. Moksh Jain, Tristan Deleu, Jason Hartford, Cheng-Hao Liu, Alex Hernandez-Garcia, and Yoshua Bengio. Gflownets for ai-driven scientific discovery. Digital Discovery, 2023a. Moksh Jain, Sharath Chandra Raparthy, Alex Hernandez-Garcia, Jarrid Rector-Brooks, Yoshua Bengio, Santiago Miret, and Emmanuel Bengio. Multi-objective gflownets. In International Conference on Machine Learning, pp. 14631 14653, 2023b. Leslie Pack Kaelbling. Learning to achieve goals. In IJCAI, volume 2, pp. 1094 8. Citeseer, 1993. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR), 2015. Published as a conference paper at ICLR 2024 Saurabh Kumar, Aviral Kumar, Sergey Levine, and Chelsea Finn. One solution is not all you need: Few-shot extrapolation via structured maxent rl. Advances in Neural Information Processing Systems, 33:8198 8210, 2020. Elaine Lau, Stephen Zhewen Lu, Ling Pan, Doina Precup, and Emmanuel Bengio. Qgfn: Controllable greediness with action values. ar Xiv preprint ar Xiv:2402.05234, 2024. Hao Liu and Pieter Abbeel. Aps: Active pretraining with successor features. In International Conference on Machine Learning, pp. 6736 6747. PMLR, 2021. Ronny Lorenz, Stephan H Bernhart, Christian H oner Zu Siederdissen, Hakim Tafer, Christoph Flamm, Peter F Stadler, and Ivo L Hofacker. Vienna RNA package 2.0. Algorithms for molecular biology, 6(1):26, 2011. Kanika Madan, Jarrid Rector-Brooks, Maksym Korablyov, Emmanuel Bengio, Moksh Jain, Andrei Nica, Tom Bosc, Yoshua Bengio, and Nikolay Malkin. Learning GFlow Nets from partial episodes for improved convergence and stability. ar Xiv preprint 2209.12782, 2022. Nikolay Malkin, Moksh Jain, Emmanuel Bengio, Chen Sun, and Yoshua Bengio. Trajectory balance: Improved credit assignment in GFlow Nets. Neural Information Processing Systems (Neur IPS), 2022. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529 533, 2015. Ashvin V Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, and Sergey Levine. Visual reinforcement learning with imagined goals. Advances in neural information processing systems, 31, 2018. Ling Pan, Qingpeng Cai, and Longbo Huang. Softmax deep double deterministic policy gradients. Advances in Neural Information Processing Systems, 33:11767 11777, 2020. Ling Pan, Nikolay Malkin, Dinghuai Zhang, and Yoshua Bengio. Better training of gflownets with local credit and incomplete trajectories. ar Xiv preprint ar Xiv:2302.01687, 2023a. Ling Pan, Dinghuai Zhang, Aaron Courville, Longbo Huang, and Yoshua Bengio. Generative augmented flow networks. International Conference on Learning Representations (ICLR), 2023b. Ling Pan, Dinghuai Zhang, Moksh Jain, Longbo Huang, and Yoshua Bengio. Stochastic generative flow networks. ar Xiv preprint ar Xiv:2302.09465, 2023c. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019. Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In International conference on machine learning, pp. 1312 1320. PMLR, 2015. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint 1707.06347, 2017. Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. Planning to explore via self-supervised world models. In International Conference on Machine Learning, pp. 8583 8592. PMLR, 2020. Sam Sinai, Richard Wang, Alexander Whatley, Stewart Slocum, Elina Locane, and Eric Kelsic. Adalead: A simple and robust adaptive greedy search algorithm for sequence design. ar Xiv preprint, 2020. Fabio Urbina, Filippa Lentzos, C edric Invernizzi, and Sean Ekins. Dual use of artificial-intelligencepowered drug discovery. Nature Machine Intelligence, 4(3):189 191, 2022. Emile van Krieken, Thiviyan Thanapalasingam, Jakub M Tomczak, Frank van Harmelen, and Annette ten Teije. A-nesi: A scalable approximate method for probabilistic neurosymbolic inference. ar Xiv preprint ar Xiv:2212.12393, 2022. Published as a conference paper at ICLR 2024 Vivek Veeriah, Junhyuk Oh, and Satinder Singh. Many-goals reinforcement learning. ar Xiv preprint ar Xiv:1806.09605, 2018. David W Zhang, Corrado Rainone, Markus Peschl, and Roberto Bondesan. Robust scheduling with gflownets. ar Xiv preprint ar Xiv:2302.05446, 2023a. Dinghuai Zhang, Jie Fu, Yoshua Bengio, and Aaron Courville. Unifying likelihood-free inference with black-box optimization and beyond. ar Xiv preprint ar Xiv:2110.03372, 2021. Dinghuai Zhang, Hanjun Dai, Nikolay Malkin, Aaron Courville, Yoshua Bengio, and Ling Pan. Let the flows tell: Solving graph combinatorial optimization problems with gflownets. ar Xiv preprint ar Xiv:2305.17010, 2023b. Dinghuai Zhang, Ling Pan, Ricky TQ Chen, Aaron Courville, and Yoshua Bengio. Distributional gflownets with quantile flows. ar Xiv preprint ar Xiv:2302.05793, 2023c. Rui Zhao, Yang Gao, Pieter Abbeel, Volker Tresp, and Wei Xu. Mutual information state intrinsic control. ar Xiv preprint ar Xiv:2103.08107, 2021. Published as a conference paper at ICLR 2024 A ADDITIONAL DETAILS OF GFLOWNETS A.1 DETAILS FOR THE TRAINING LOSSES As introduced in Section 2, the flow consistency constraint for detailed balance (DB) (Bengio et al., 2023) is F(s)PF (s |s) = F(s )PB(s|s ) for s s A. In practice, we train DB in the log-scale for stability (Bengio et al., 2021), i.e., s s A, LDB(s, s ) = log F(s)PF (s |s) F(s )PB(s|s ) and terminal flows F(x) to match the corresponding rewards R(x). The learnable objects here are the state flow F(s), forward policy PF (s |s), and backward policy PB(s|s ), which are parameterized by neural networks. For GAFlow Nets (Pan et al., 2023b) that consider intrinsic intermediate rewards ri(s s ) into the flow network, we also train the model in log-scale according to Eq. (2), i.e., s s A, LGAFN (DB)(s, s ) = log F(s)PF (s |s) F(s )PB(s|s ) + ri(s s ) The learnable objects are the same as in DB, except that ri(s s ) is also learnable, which is represented by intrinsic motivation by random network distillation (Burda et al., 2018)). B.1 PROOF OF PROPOSITION 4.1 Proposition 4.1 If LOC-GFN(τ, y) = 0 for all trajectories τ and outcomes y, then the outcomeconditioned forward policy PF (s |s, y) can successfully reach any target outcome y. Proof. As LOC-GFN(τ, y) = 0 is satisfied for all trajectories τ = {s0, , sn = x} and outcomes y, we have that t=0 PF (st+1|st, y) = t=0 R(x|y)PB(st|st+1, y). (12) Since R(x|y) is either 1 or 0 in outcome-condition tasks, we get that t=0 PF (st+1|st, y) = R(x|y) t=0 PB(st|st+1, y). (13) Then, the probability of reaching the target outcome y is P(x = y|y) = X τ,sn=y P(τ|y). (14) By definition, we have that t=0 PF (st+1|st, y). (15) Therefore, we get that P(x = y|y) = X t=0 PF (st+1|st, y). (16) Combining Eq. (13) with Eq. (16), and due to the law of total probability, we obtain that F(s0|y)P(x = y|y) = X τ,sn=y R(x|y) t=0 PB(st|st+1, y) t=0 PB(st|st+1, y) t=0 PB(st|st+1, y) = 1, Published as a conference paper at ICLR 2024 where we note that R(y|y) = 1 according to the definition of the reward function (which is binary). With the same analysis for the case where the agent fails to reach the target outcome y, i.e., x = y and R(x|y) = 0, we have that x = y, F(s0|y)P(x|y) = 0. (18) Combing Eq. (17) with Eq. (18), we have that P(x = y|y) = 1, i.e., the outcome-conditioned forward policy PF (s |s, y) can successfully reach any target outcome y. B.2 ANALYSIS OF THE CONVERSION POLICY We now elaborate on more details about the effect of Eq. (6) in the text based on (Bengio et al., 2023). When the outcome-conditioned GFlow Net (OC-GFN) is trained to completion, the following flow consistency constraint in the edge level is satisfied for intermediate states. F(s|y)PF (s |s, y) = F(s |y)PB(s|s , y) (19) We define the state flow function as F r(s) = P y r(y)F(s|y), and the backward policy as P r B(s|s ) = y r(y)F(s |y)PB(s|s , y) P y r(y)F(s |y) , (20) while the forward policy is defined in Eq. (6), i.e., P r F (s |s) = P y r(y)F(s|y)PF (s |s, y) P y r(y)F(s|y) (21) Then, we have F r(s)P r F (s |s) = X y r(y)F(s|y)PF (s |s, y), (22) and F r(s )P r B(s|s ) = X y r(y)F(s |y)PB(s|s , y). (23) Combining the above equations, we have that F r(s)P r F (s |s) = F(s )r P r B(s|s ), (24) which corresponds to a new flow consistency constraint in the edge level. A more detailed proof can be found in (Bengio et al., 2023). Discussion about applicability It is also worth noting that OC-GFN should be built upon the detailed balance objective (Bengio et al., 2023) discussed in Section 2 or other variants which learn flows. Instead, the trajectory balance objective (Malkin et al., 2022), does not learn a state flow function necessary for converting the pre-trained OC-GFN model to the new policy πr. We obtain the learning objective for OC-GFN when built upon sub-trajectory balance (Sub TB) (Madan et al., 2022) for a sub-trajectory τi:j = {si, , sj} as in Eq. (25) following Section 4.1.1. t=i PF (st+1|st, y) = F(sj|y) t=i PB(st|st+1, y). (25) Further incorporating our proposed outcome teleportation technique for OC-GFN (Sub TB) results in the training objective as in Eq. (26), which is trained in the log domain: LOC-GFN (Sub TB)(τ) = X log F(si|y) + t=i log PF (st+1|st, y) log F(sj|y) t=i log PB(st|st+1, y) log R(x|y) Published as a conference paper at ICLR 2024 where wij = λj i P 0 i