# orderpreserving_gflownets__ae10b64d.pdf Published as a conference paper at ICLR 2024 ORDER-PRESERVING GFLOWNETS Yihang Chen Section of Communication Systems EPFL, Switzerland yihang.chen@epfl.ch Lukas Mauch Sony Europe B.V. Stuttgart Laboratory 1, Germany lukas.mauch@sony.com Generative Flow Networks (GFlow Nets) have been introduced as a method to sample a diverse set of candidates with probabilities proportional to a given reward. However, GFlow Nets can only be used with a predefined scalar reward, which can be either computationally expensive or not directly accessible, in the case of multi-objective optimization (MOO) tasks for example. Moreover, to prioritize identifying high-reward candidates, the conventional practice is to raise the reward to a higher exponent, the optimal choice of which may vary across different environments. To address these issues, we propose Order-Preserving GFlow Nets (OP-GFNs), which sample with probabilities in proportion to a learned reward function that is consistent with a provided (partial) order on the candidates, thus eliminating the need for an explicit formulation of the reward function. We theoretically prove that the training process of OP-GFNs gradually sparsifies the learned reward landscape in single-objective maximization tasks. The sparsification concentrates on candidates of a higher hierarchy in the ordering, ensuring exploration at the beginning and exploitation towards the end of the training. We demonstrate OP-GFN s state-of-the-art performance in single-objective maximization (totally ordered) and multi-objective Pareto front approximation (partially ordered) tasks, including synthetic datasets, molecule generation, and neural architecture search. 1 1 INTRODUCTION Generative Flow Networks (GFlow Nets) are a novel class of generative machine learning models. As introduced by Bengio et al. (2021b), these models can generate composite objects x X with probabilities that are proportional to a given reward function, i.e., p(x) R(x). Notably, GFlow Nets can represent distributions over composite objects such as sets and graphs. Their applications include the design of biological structures (Bengio et al., 2021a; Jain et al., 2022), Bayesian structure learning (Deleu et al., 2022; Nishikawa-Toomey et al., 2022), robust scheduling (Zhang et al., 2023a), and graph combinatorial problems (Zhang et al., 2023b). GFlow Nets can also provide a route to unify many generative models (Zhang et al., 2022), including hierarchical VAEs (Ranganath et al., 2016), normalizing flows (Dinh et al., 2014), and diffusion models (Song et al., 2021). We investigate GFlow Nets for combinatorial stochastic optimization of black-box functions. More specifically, we want to maximize a set of D objectives over X, u(x) RD. In the objective space RD, we define the the Pareto dominance on vectors u, u RD, such that u u k, uk u k. We remark that induces a total order on X for D = 1, and a partial order for D > 1. 2 We identify two problems: 1. GFlow Nets require an explicit formulation of a scalar reward R(x) that measures the global quality of an object x. In the multi-objective optimization where D > 1, GFlow Nets cannot be directly applied and u(x) has to be scalarized in prior (Jain et al., 2023; Roy et al., 2023). Besides, evaluating the exact value of the objective function might be costly. For example, in the neural architecture search (NAS) task (Dong et al., 2021), evaluating the architecture s test accuracy requires training the network to completion. 1Our codes are available at https://github.com/yhangchen/OP-GFN. 2A partial order is a homogeneous binary relation that is reflexive, transitive, and antisymmetric. A total order is a partial order in which any two elements are comparable. Published as a conference paper at ICLR 2024 2. For a scalar objecive function u(x) where D = 1, to prioritize the identification of candidates with high u(x) value, GFlow Nets typically operate on the exponentially scaled reward R(x) = (u(x))β, termed as GFN-β (Bengio et al., 2021a). Note, that the optimal β heavily depends on the geometric landscape of u(x). A small β may hinder the exploitation of the maximal reward candidates since even a perfectly fitted GFlow Net will still have a high probability of sampling outside the areas that maximize u(x). On the contrary, a large β may hinder the exploration since the sampler will encounter a highly sparse reward landscape in the early training stages. In this work, we propose a novel training criterion for GFlow Nets, termed Order-Preserving (OP) GFlow Nets. The main difference to standard GFlow Nets is that OP-GFNs are not trained to sample proportional to an explicitly defined reward R : X R+, but proportional to a learned reward that should be compatible with a provided (partial) order over X induced by u. We summarize our contributions in the following: 1. We propose the OP-GFNs for both the single-objective maximization and multi-objective Pareto approximation, which require only the (partial-)ordering relations among candidates. 2. We empirically evaluate our method on synthesis environment Hyper Grid (Bengio et al., 2021a), and two real-world applications: NATS-Bench (Dong et al., 2021), and molecular designs (Shen et al., 2023; Jain et al., 2023) to demonstrate its advantages in the diversity and the top reward (or the closeness to the Pareto front) of the generated candidates. 3. In the single-objective maximization problem, we theoretically prove that the learned reward is a piece-wise geometric progression with respect to the ranking by the objective function, and can efficiently assign high credit to the important substructures of the composite object. 4. We show that the learned order-preserving reward will balance the exploration in the early stages and the exploitation in the later stages of the training, by gradually sparsifying the reward function during the training. 2 PRELIMINARIES 2.1 GFLOWNET We give some essential definitions, following Section 3 of (Bengio et al., 2021b). For a directed acyclic graph G = (S, A) with state space S and action space A, we call the vertices states, the edges actions. Let s0 S be the initial state, i.e., the only state with no incoming edges; and X the set of terminal states, i.e., the states with no outgoing edges. We note that Bengio et al. (2021a) allows terminal states with outgoing edges. Such difference can be dealt with by augmenting every such state s by a new terminal state s with the terminal action s s . A complete trajectory is a sequence of transitions τ = (s0 s1 . . . sn) going from the initial state s0 to a terminal state sn with (st st+1) A for all 0 t n 1. Let T be the set of complete trajectories. A trajectory flow is a nonnegative function F : T R 0. For any state s, define the state flow F(s) = P s τ F(τ), and, for any edge s s , the edge flow F(s s ) = P τ=(... s s ... ) F(τ). The forward transition PF and backward transition probability are defined as PF (s |s) := F(s s )/F(s), PB(s|s ) = F(s s )/F(s ) for the consecutive state s, s . We define the normalizing factor Z = P x X F(x). Suppose that a nontrivial nonnegative reward function R : X R 0 is given on the set of terminal states. GFlow Nets (Bengio et al., 2021a) aim to approximate a Markovian flow F( ) on the graph G such that F(x) = R(x), x X. In general, an objective optimizing for this equality cannot be minimized directly because F(x) is a sum over all trajectories leading to x. Previously, several objectives, such as flow matching, detailed balance, trajectory balance, and subtrajectory balance, have previously been proposed. We list their names, parametrizations and respective constraints in Table 2.1. By Proposition 10 in Bengio et al. (2021b) for flow matching, Proposition 6 of Bengio et al. (2021b) for detailed balance, and Proposition 1 of Malkin et al. (2022) for trajectory balance, if the training policy has full support and respective constraints are reached, the GFlow Net sampler does sample from the target distribution. 2.2 MULTI-OBJECTIVE OPTIMIZATION The Multi-Objective Optimization (MOO) problem can be described as the desire to maximize a set of D > 1 objectives over X. We summarize these objectives in the vector-valued function Published as a conference paper at ICLR 2024 Method Parametrization (by θ) Constraint Flow matching (FM, Bengio et al. (2021a)) F(s s ) P (s s) A F(s s) = P (s s ) A F(s s ). Detailed balance (DB, Bengio et al. (2021b)) F(s), PF (s |s), PB(s|s ) F(s)PF (s |s) = F(s )PB(s|s ). Trajectory balance (TB, Malkin et al. (2022)) PF (s |s), PB(s|s ), Z Z Qn t=1 PF (st|st 1) = R(x) Qn t=1 PB(st 1|st) Sub Trajectory balance (sub TB, Madan et al. (2022)) F(s), PF (s |s), PB(s|s ) F(sm1) Qm2 t=m1+1 PF (st|st 1) = F(sm2) Qm2 t=m1+1 PB(st 1|st) Table 2.1: GFlow Net objectives. We define the complete trajectory τ = (s0 s1 . . . sn = x), and its subtrajectory sm1 sm2, 0 m1 < m2 n. u(x) RD. The Pareto front of the objective set S is defined by Pareto(S) := {u S : u = u S, s.t. u u }, and the Pareto front of sample set X is defined by Pareto(X) := {x X : x = x, s.t. u(x) u(x )}. The image of u(x) is U := {u(x) : x X}. Such MOO problems are typically solved by scalarization methods, i.e. preference or goal conditioning. A GFN-based approach to multi-objective optimization (Jain et al., 2023), called preferenceconditioning (PC-GFNs), amounts to scalarizing the objective function by using a set of preferences w: Rw(x) := w u(x), where 1 w = 1, wk 0. However, the problem of this scalarization is that, only when the Pareto front is convex, we can obtain the equivalence w, x arg max Rw(x) x Pareto(X). In particular, the " " relation does not hold on the non-convex Pareto front, which means that some Pareto front solutions never maximize any linear scalarization. Recently, Roy et al. (2023) proposed goal-conditioning (GC-GFNs), a more controllable conditional model that can uniformly explore solutions along the entire Pareto front under challenging objective landscapes. A sample x is defined in the focus region g if the cosine similarity between its objective vector u and the goal direction dg is above the threshold cg, i.e. g := {u Rk : cos u, dg := u dg u dg cg}. The reward function Rg depends on the current goal g so that the conditioned reward will only be non-zero in the focus region, i.e., Rg(x) := 1 u(x) 1[u(x) g], where 1[ ] is the indicator function. 3.1 OVERVIEW We define some terminology here. Let TB := {(si 0 si ni = xi)}B i=1 be a batch of trajectories of size B, XB = {xi}B i=1 be the set of terminal states that are reached by the trajectories in batch TB, and SB = {u(xi)}B i=1 be the (vector) objective set. We use u(x) to denote the single or multiobjective function for a unified discussion, and u(x) for single-objective function only. We will drop the subscript B when the batch size is irrelevant. Let R(x) be the predefined reward function used in previous GFlow Nets methods, such as R(x) = (u(x))β when D = 1. We call β the reward exponent and assume β = 1 if not mentioned otherwise. We also let b R(x) be an undetermined reward function that we will learn in the training. In the GFlow Net framework, b R( ) is parametrized by the GFlow Net parameters θ. The training of the order-preserving GFlow Net can be divided into two parts: 1) The order-preserving criterion, and 2) The Markov-Decision-Process (MDP) constraints. 1) Order preserving. We define a local labeling y( ; X) for x X on a batch of terminal states (candidates) X. Let Pareto(X) be the non-dominated candidates in X, the labeling is defined by y(x; X) := 1[x Pareto(X)], which induces the labeling distribution Py(x|X). The reward b R( ) also induces a local distribution on the sample set X, denoted by P(x|X, b R). We have Py(x|X) := 1[x Pareto(X)] |Pareto(X)| , P(x|X, b R) := b R(x) P x X b R(x ) , x X. (1) Since we want b R( ) to keep the ordering of u, we minimize the KL divergence. The order-preserving loss is, therefore, LOP(X; b R) := KL(Py( |X) P( |X, b R)). (2) Published as a conference paper at ICLR 2024 In the TB parametrization, the state flow F( ) is not parametrized. By the trajectory balance equality constraints in Table 2.1, the order-preserving reward of x on trajectory τ ending at x is therefore b RTB(x; θ) := Zθ t=1 PF (st|st 1; θ)/PB(st 1|st; θ). (3) For the non-TB parametrizations, where the state flow is parametrized, we let b R(x; θ) = F(x; θ). 2) MDP Constraints. For the TB parametrization, where each trajectory is balanced, the MDP constraint loss is 0. For non-TB parametrization, we introduce a hyperparameter λOP to balance the order-preserving loss and MDP constraint loss. For the ease of discussion, we set λOP = 1 for TB parametrization. We defer the detailed descriptions to Appendix C. 3.2 SINGLE-OBJECTIVE MAXIMIZATION In the single-objective maximization, i.e. D = 1, there exists a global ordering, or ranking induced by u(x), among all the candidates. We consider the scenario where the GFlow Net is used to sample arg maxx X u(x). We assume the local labeling is defined on pairs, i.e., X = (x, x ). Therefore, from Equation (1), the labeling and local distributions are, Py(x|X) = 1(u(x) > u(x )) + 1(u(x) u(x )) 2 , P(x|X, b R) = b R(x) b R(x) + b R(x ) , and consequently we can calculate Equation (2) of LOP(X = (x, x ); b R). In the following, we provide some theoretical analysis under the single-objective maximization task. We defer our experimental verifications to Figure E.6 in Appendix E.2.3, where we visualize the average of learned b R( ) on states of the same objective values. 1) Order-Preserving. We claim that the learned reward b R(xi) is a piece-wise geometric progression with respect to the ranking i. We first consider the special case: when u(xi) is mutually different, log b R(xi) is an arithmetic progression. Proposition 1 (Mutually different). For {xi}n i=0 X, assume that u(xi) < u(xj), 0 i < j n. The order-preserving reward b R(x) [1/γ, 1] is defined by the reward function that minimizes the order-preserving loss for neighboring pairs LOP N, i.e., b R( ) := arg min r,r(x) [1/γ,1] LOP N({xi}n i=0; r) := arg min r,r(x) [1/γ,1] i=1 LOP({xi 1, xi}; r). (4) We have b R(xi) = γi/n 1, 0 i n, and LOP N({xi}n i=0; b R) = n log(1 + 1/γ). In practice, γ is not a fixed value. Instead, γ is driven to infinity when minimizing LOP N with a variable γ. Combined with Proposition 1, we claim that the order-preserving loss gradually sparsifies the learned reward b R(x) on mutually different objective u(x) during training, where b R(xi) 0, i = n. We also consider the more general setting, where i = j, such that u(xi) = u(xj). We first give an informal proposition in Proposition 2, and defer its formal version and proof to Proposition 5. Proposition 2 (Informal). For {xi}n i=0 X, assume that u(xi) u(xj), 0 i < j n. Using the notations in Proposition 1, when γ is sufficiently large, there exists αγ, βγ, dependent on γ, such that b R(xi+1) = αγ b R(xi) if u(xi+1) > u(xi), and b R(xi+1) = βγ b R(xi) if u(xi+1) = u(xi), for 0 i n 1. Also, minimize the LOP N qith a variable γ will drive γ , αγ , βγ 1. If we do not fix a positive lower bound 1/γ, i.e., let b R(x) [0, 1], minimizing LOP N defined in Equation (4) will drive αγ , βγ 1 as γ , which indicates b R(xj)/ b R(xi) 1 if u(xi) = u(xj), and b R(xj)/ b R(xi) if u(xi) < u(xj) as training goes on, i.e. b R( ) enlarges the relative gap between different objective values, and assign similar reward to states of the same objective values. Since the GFlow Net sampler samples candidates with probabilities in proportion to b R( ), it can sample high objective value candidates with a larger probability as the training progresses. We remark that the sparsification of the learned reward and the training of the GFlow Net happens simultaneously. Therefore, the exploration on the early training stages and the exploitation on the later sparsified reward b R( ) can both be achieved. Published as a conference paper at ICLR 2024 (B) MDP Constraints. In this part, we match the flow F( ) with b R( ), where b R( ) is fixed and learned in Proposition 2. We consider the sequence prepend/append MDP (Shen et al., 2023). Definition 1 (Sequence prepend/append MDP). In this MDP, states are strings, and actions are prepending or appending a symbol in an alphabet. This MDP generates strings of length l. In the following proposition, we claim that matching the flow F( ) with a sufficiently trained OP-GFN (i.e. sufficiently large γ) will assign high flow value on non-terminal states that on the trajectories ending in maximal reward candidates. Proposition 3. In the MDP in Definition 1, we consider the dataset {xi, x n}n i=0 with u(x0) < u(x1) < < u(xn) = u(x n). We define the important substring s as the longest substring shared by xn, x n, and sk(x) as the set of k-length substrings of x. Following Proposition 5, let the orderpreserving reward b R( ) [1/γ, 1], and the ratio αγ. We fix PB to be uniform, and match the flow F( ) with b R( ) on terminal states. Then, when αγ > 4, we have EF(s ) > EF(s), s s|s |(x)\s , where the expectation is taken over the random positions of s in xn, x n. 3.3 EVALUATION 3.3.1 SINGLE-OBJECTIVE GFlow Nets for single-objective tasks are typically evaluated by measuring their ability to match the target distribution, e.g. by using the Spearman correlation between log p(x) and R(x) (Madan et al., 2022; Nica et al., 2022). However, our method learns a GFlow Net to sample the terminal states proportional to the reward b R(x) instead of R(x), which is unknown in prior. Therefore, we only focus on evaluating the GFlow Net s ability to discover the maximal objective. 3.3.2 MULTI-OBJECTIVE The evaluation of the multi-objective sampler is significantly more difficult. We assume the reference set P := {pj}|P | j=1 is the set of the true Pareto front points, and S = {si}|S| i=1 is the set of all the generated candidates, P = Pareto(S) is non-dominated points in S. When the true Pareto front is unknown, we use a discretization of the extreme faces of the objective space hypercube as P. Audet et al. (2021) summarized various performance indicators to measure the convergence and uniformity of the generated Pareto front. Specifically, we measure: 1) the convergence of all the generated candidates S to the true Pareto front P, by Averaged Hausdorff distance; 2) the convergence of the estimated Pareto front P to the true Pareto front P, by IGD+, Hyper Volume, R2 indicator. 3) the uniformity of the estimated front P with respect to P, by PC-ent, R2 indicator. We discuss these metrics in Appendix F.1. During the computation on the estimated Pareto front, we will not de-duplicate the identical objective vectors. 4 SINGLE-OBJECTIVE EXPERIMENTS 4.1 HYPERGRID We study a synthetic Hyper Grid environment introduced by (Bengio et al., 2021a). In this environment, the states S form a D-dimensional Hyper Grid with side length H: S = {(s1, . . . , s D) | (H 1) sd {0, 1, . . . , H 1}, d = 1, . . . , D}, and non-stop actions are operations of incrementing one coordinate in a state by 1 H 1 without exiting the grid. The initial state is (0, . . . , 0). For every state s, the terminal action is s s = x. The objective at the state x = (x1, . . . , x D) is given by u(x) = R0 + 0.5 QD d=1 I xd 0.5 (0.25, 0.5] + 2 QD d=1 I xd 0.5 (0.3, 0.4) , where I is an indicator function and R0 is a constant controlling the difficulty of exploration. The ability of standard GFlow Nets, i.e. R(x) = u(x), is sensitive to R0. Large R0 facilitates exploration but hinders exploitation, whereas low R0 facilitates exploitation but hinders exploration. However, OP-GFNs only deal with the pairwise order relation, so is independent of R0. We set (D, H) = (2, 64), (3, 32) and (4, 16), and compare TB and order-preserving TB (OP-TB). For (D, H) = (2, 64), we plot the observed distribution on 4000 most recently visited states in Figure E.4. We consider the following three ratios to measure exploration-exploitation: 1) #(distinctly visited states)/#(all the states); 2) #(distinctly visited maximal states)/ #(all the maximal states); 3) In the most recently 4000 visited states, #(distinctly maximal states)/4000 in Figure E.5. A good Published as a conference paper at ICLR 2024 sampling algorithm should have a small ratio 1), and a large ratio 2), 3). We confirm that TB s performance is sensitive to the choice of R0, and observe that OP-TB can recover all maximal areas more efficiently, and sample maximal candidates with higher probability after visiting fewer distinct candidates. The detailed discussions are in Appendix E.2.2. To conclude, OP-GFNs can balance exploration and exploitation without the selection of R0. 4.2 MOLECULAR DESIGN We study various molecular designs environments (Bengio et al., 2021a), including Bag, TFBind8, TFBind10, QM9, s EH, whose detailed descriptions are in Appendix E.3.1. We use the the sequence formulation (Shen et al., 2023) for the molecule graph generation, i.e., sequence prepend/append MDP in Definition 1. We define the optimal candidates in Bag as the x with objective value 30, in SIX6, PHO4, QM9 as the top 0.5% x ranked by the objective value, in s EH as the top 0.1% x ranked by the objective value. The total number of such candidates for Bag, SIX6, PHO4, QM9, s EH is 3160, 328, 5764, 805, 34013 respectively. We consider previous GFN methods and reward-maximization methods as baselines. Previous GFN methods include TB, DB, sub TB, maximum entropy (Max Ent, Malkin et al. (2022)), and substructureguided trajectory balance (GTB, Shen et al. (2023)). For reward-maximization methods, we consider a widely-used sampling-based method in the molecule domain, Markov Molecular Sampling (MARS, Xie et al. (2021)), and RL-based methods, including actor-critic (A2C, Mnih et al. (2016)), Soft Q-Learning (SQL, Hou et al. (2020)), and proximal policy optimization (PPO, Schulman et al. (2017)). We run the experiments over 3 seeds, and plot the mean and variance of the objective value of top-100 candidates ranked by u(x), and also plot the number of optimal candidates being found among all the generated candidates in Figure 4.1. We find that the order-preserving method outperforms all the baselines in both the ability to find the number of different optimal candidates and the average top-k performance. 0 500 1000 1500 2000 # Optimal Found 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 0 5000 10000 15000 20000 0 500 1000 1500 2000 Round Top-k performance 0 500 1000 1500 2000 Round 0 500 1000 1500 2000 Round Ours(OP-TB) Max Ent TB DB sub TB GTB 0 500 1000 1500 2000 Round 0 5000 10000 15000 20000 Round 0 500 1000 1500 2000 # Optimal Found 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 0 5000 10000 15000 20000 0 500 1000 1500 2000 Round Top-k performance 0 500 1000 1500 2000 Round 0 500 1000 1500 2000 Round Ours(OP-TB) A2C SQL PPO MARS 0 500 1000 1500 2000 Round 0 5000 10000 15000 20000 Round Figure 4.1: Molecular design: In the environment Bag, QM9, s EH, TFBind8, TFBind10, we test our algorithm (OP-TB) against previous GFN methods (Max Ent, TB, DB, sub TB, GTB), and (RL- )sampling methods (MARS, A2C, SQL, PPO). Published as a conference paper at ICLR 2024 4.3 NEURAL ARCHITECTURE SEARCH 4.3.1 NAS ENVIRONMENT We study the neural architecture search environment NATS-Bench (Dong et al., 2021), which includes three datasets: CIFAR10, CIFAR-100 and Image Net-16-120. We choose the topology search space in NATS-Bench. The neural architecture search can be regarded as a sequence generation problem, where each sequence of neural network operations uniquely determines an architecture. In the MDP, each forward action fills any empty position in the sequence, starting from the empty sequence. For sequence x X, the objective function u T (x) is the test accuracy of x s corresponding architecture with the weights at the T-th epoch during its standard training pipeline. To measure the cost to compute u T ( ), we introduce the simulated train and test (T&T) time, which is defined by the time to train the architecture to the epoch T and evaluate its test accuracy at epoch T. NATS-Bench provides APIs on u T ( ) and its T&T time for T 200. Following Dong et al. (2021), when training the GFlow Net, we use the test accuracy at epoch 12 (u12( )) as the objective function in training; when evaluating the candidates, we use the test accuracy at epoch 200 (u200( )) as the objective function in testing. We remark that u12( ) is a proxy for u200( ) with lower T&T time, and the global ordering induced by u12( ) is also a proxy for the global ordering induced by u200( ). OP-GFNs only access the ordering of u12( ), ignoring the unnecessary information of u12( ) s exact value. 4.3.2 EXPERIMENTAL DETAILS We train the GFlow Net in a multi-trial sampling procedure described in Appendix E.4.2, and optionally use the backward KL regularization (-KL) and trajectories augmentation (-AUG) introduced in Appendix E.1. To monitor the training process, in each training round, we will record the architecture that has the highest objective function in training and the accumulated T&T time up to that round. We terminate the training when the accumulated T&T time reaches the threshold, of 50000, 100000 and 200000 seconds for CIFAR10, CIFAR100, and Image Net-16-120 respectively. We adopt the RANDOM as baselines, and compare our results against previous multi-trial sampling methods: 1) evolutionary strategy, e.g., REA (Real et al., 2019); 2) reinforcement learning (RL)-based methods, e.g., REINFORCE (Williams, 1992), 3) HPO methods, e.g., BOHB (Falkner et al., 2018), whose experimental settings are described in Appendix E.4.2. To evaluate the algorithms, we plot the averaged accuracy (at epochs 12 and 200) of the recorded sequence of architectures with respect to the recorded sequence of accumulated T&T time, over 200 random seeds. 3 We report the results on trajectory balance (TB) and its order-preserving variants (OP-TB) in Figure 4.24, and defer the results of non-TB methods to Appendix E.4.5, where we include the ablation studies of (OP-)DB, (OP-)FM, (OP-)sub TB, and the hyperparameters λOP. We observe that OP-non-TB methods can achieve similar performance gain with OP-TB, which validates the effectiveness and generality of order-preserving methods. Moreover, we compare the OP-TB with the GFN-β algorithm in Appendix E.4.4, where β = 4, 8, 16, 32, 64, 128, and plot the results in Figure E.8. We observe that the OP-GFN outperforms the GFN-β method by a significant margin. We conclude that order-preserving GFlow Nets consistently improve over the previous baselines in both the objective functions used in training and testing, especially in the early training stages. Besides, backward KL regularization and backward trajectory augmentation also contribute positively to the sampling efficiency. Finally, once we get a trained GFlow Net sampler, we can also use the learned order-preserving reward as a proxy to further boost the sampling efficiency, see Appendix E.4.3. 5 MULTI-OBJECTIVE EXPERIMENTS 5.1 HYPERGRID We study Hyper Grid environment in Section 4.1 with (D, H) = (2, 32), and four normalized objectives: brannin, currin, shubert, beale, see Appendix F.2 for details. All the objectives are normalized to fall between 0 and 1. The true Pareto front of Hyper Grid environment can be explicitly obtained by enumerating all the states. 3Since different runs sample different sequences of architectures, and different architectures have different T&T times, each run s sequence of time coordinates may not be uniformly spaced. We first linearly interpolate each run s sequences of points and then calculate the mean and variance on some fixed reference points. 4We remark that the first 64 candidates of GFN methods are generated by random policy. The point at which we observe a sudden performance increase of the GFN methods indicates the start of the training. Published as a conference paper at ICLR 2024 0 10000 20000 30000 40000 50000 T est accuracy at Epoch 12 0 20000 40000 60000 80000 100000 0 50000 100000 150000 200000 Image Net16-120 0 10000 20000 30000 40000 50000 Estimated wall-clock time T est accuracy at Epoch 200 0 20000 40000 60000 80000 100000 Estimated wall-clock time RANDOM TB OPTB OPTB-KL OPTB-KL-AUG BOHB REA REINFORCE 0 50000 100000 150000 200000 Estimated wall-clock time Figure 4.2: Multi-trial training of a GFlow Net sampler. Best test accuracy at epoch 12 and 200 of random baseline (Random), GFlow Net methods (TB, OP-TB, OP-TB-KL, OP-TB-KL-AUG), and other multi-trial algorithms (REA, BOHB, REINFORCE). The training procedures are described in Appendix F.2. We generate 1280 candidates by the learned GFlow Net sampler in the evaluation as S, and report the metrics in Table F.1. To visualize the sampler, we plot all the objective vectors, and the true Pareto front, as well as the first 128 generated objective vectors, and their estimated Pareto front, in the objective space [0, 1]2 and [0, 1]3 in Figure F.1. We observe that our sampler achieves better approximation (i.e. almost zero IGD+ and smaller d H), and uniformity (i.e. higher PC-ent) to the Pareto front, especially in the non-convex Pareto front, such as currin-shubert. We also plot the learned reward distributions of OP-GFNs and compare them with the indicator functions of the true Pareto front solutions in Figure 5.1, where we observe that OP-GFNs can learn a highly sparse reward function that concentrates on the true Pareto solutions, outperforming PC-GFNs. branin-currin branin-shubert branin-beale currin-shubert currin-beale shubert-beale Figure 5.1: Reward Distribution: We plot the indicator function of the true Pareto front solutions and the learned reward distribution of the OP-GFNs and PC-GFNs. The synthetic sequence design task n-gram proposed by Stanton et al. (2022), is to generate sequences of a fixed maximum length L = 36. The vocabulary (action) to construct the sequence is of size 21, with 20 characters and a special token to end the sequence. The objectives are defined by the number Published as a conference paper at ICLR 2024 of occurrences of a given set of n-grams in a sequence x. We consider unigrams and bigrams in our experiments and summarize the objectives in Table F.2. In the multi-objective optimization, we use a replay buffer to help stabilize the training (Roy et al., 2023, Appendix C.1). Specifically, instead of the on-policy update, we push the online sampled batch into the replay buffer, and immediately sample a batch of the same size to stabilize the training. We defer detailed experimental settings to Appendix F.3, and the experiment results are summarized in Table F.3. We observe that OP-GFNs achieve better performances on most of the tasks. 5.3 DNA SEQUENCE GENERATION An instance illustrating a real-world scenario where the GFlow Net graph takes the form of a tree is the creation of DNA aptamers, which are single-stranded sequences of nucleotides widely employed in the realm of biological polymer design (Zhou et al., 2017; Yesselman et al., 2019). We generate the DNA sequences by adding one nucleobase ("A", "C", "T", "G") at a time, with a length of 30. We consider three objectives, (1) energy: the free energy of the secondary structure calculated by the software NUPACK (Zadeh et al., 2011); (2) pins: DNA hairpin index; (3) pairs: the number of base pairs. All the objectives are normalized to be bounded by 0 and 1. The experimental settings are detailed in Appendix F.4. We report the metrics in Table F.4, and plot the objective vectors in Figure F.2. We conclude that OP-GFNs achieve similar or better performance than preference conditioning, especially in the diversity of the estimated Pareto front. 5.4 FRAGMENT-BASED MOLECULE GENERATION The fragment-based molecule generation is a four-objective molecular generation task, (1) qed: the well-known drug-likeness heuristic QED (Bickerton et al., 2012); (2) seh: the s EH binding energy prediction of a pre-trained publicly available model (Bengio et al., 2021a); (3) sa: a standard heuristic of synthetic accessibility; (4) mw: a weight target region penalty, which favors molecules with a weight of under 300. All the objectives are normalized to be bounded by 0 and 1. We compare our OP-GFNs to both the preference (PC) and goal conditioning (GC) GFN. To stabilize the training of OP-GFNs and GC-GFNs, we use the same replay buffer as in Section 5.2. The detailed experimental settings are in Appendix F.5. In evaluation, we sample 64 candidates per round, 50 rounds using the trained sampler. We plot the estimated Pareto front in Figure 5.2, and defer the full results in Figure F.3 and Table F.5. We conclude that OP-GFNs achieve comparable or better performance with condition-based GFNs without scalarization in advance. seh-qed seh-sa seh-mw qed-sa qed-mw sa-mw PCGFN GCGFN OPGFN Figure 5.2: Fragment-Based Molecule Generation: We plot the estimated Pareto front of the generated samples in [0, 1]2. The x-, y-axis are the first, and second objective in the title of respectively. 6 CONCLUSION In this paper, we propose the order-preserving GFlow Nets that sample composite objects with probabilities in proportion to a learned reward function that is consistent with a provided (partial) order. We theoretically prove that OP-GFNs learn to sample optimal candidates exponentially more often than non-optimal candidates. The main contribution is that our method does not require an explicit scalar reward function in prior, and can be directly used in the MOO tasks without scalarization. Also, when evaluating the objective function s value is costly, but the ordering relation, such as the pairwise comparison is feasible, OP-GFNs can efficiently reduce the training cost. We will continue exploring the order-based sampling methods, especially in the MOO tasks. For example, we currently resample from the replay buffer to ensure that the training of OP-GFNs does not collapse to part of the Pareto front. In the future, we hope that we can introduce more controllable guidance to ensure the diversity of the OP-GFNs sampling. Published as a conference paper at ICLR 2024 Charles Audet, Jean Bigeon, Dominique Cartier, Sébastien Le Digabel, and Ludovic Salomon. Performance indicators in multiobjective optimization. European journal of operational research, 292(2):397 422, 2021. Luis A Barrera, Anastasia Vedenko, Jesse V Kurland, Julia M Rogers, Stephen S Gisselbrecht, Elizabeth J Rossin, Jaie Woodard, Luca Mariani, Kian Hong Kock, Sachi Inukai, et al. Survey of variation in human transcription factors reveals prevalent dna binding changes. Science, 351 (6280):1450 1454, 2016. Emmanuel Bengio, Moksh Jain, Maksym Korablyov, Doina Precup, and Yoshua Bengio. Flow network based generative models for non-iterative diverse candidate generation. Neural Information Processing Systems (Neur IPS), 2021a. Yoshua Bengio, Salem Lahlou, Tristan Deleu, Edward Hu, Mo Tiwari, and Emmanuel Bengio. GFlow Net foundations. ar Xiv preprint 2111.09266, 2021b. G Richard Bickerton, Gaia V Paolini, Jérémy Besnard, Sorel Muresan, and Andrew L Hopkins. Quantifying the chemical beauty of drugs. Nature chemistry, 4(2):90 98, 2012. L. C. Blum and J.-L. Reymond. 970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. J. Am. Chem. Soc., 131:8732, 2009. Lars Buesing, Nicolas Heess, and Theophane Weber. Approximate inference in discrete distributions with monte carlo tree search and value functions. In International Conference on Artificial Intelligence and Statistics, pp. 624 634. PMLR, 2020. Carlos A Coello Coello and Margarita Reyes Sierra. A study of the parallelization of a coevolutionary multi-objective evolutionary algorithm. In MICAI 2004: Advances in Artificial Intelligence: Third Mexican International Conference on Artificial Intelligence, Mexico City, Mexico, April 26-30, 2004. Proceedings 3, pp. 688 697. Springer, 2004. Tristan Deleu, António Góis, Chris Emezue, Mansi Rankawat, Simon Lacoste-Julien, Stefan Bauer, and Yoshua Bengio. Bayesian structure learning with generative flow networks. In Uncertainty in Artificial Intelligence, pp. 518 528. PMLR, 2022. Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. ar Xiv preprint ar Xiv:1410.8516, 2014. Xuanyi Dong and Yi Yang. Nas-bench-201: Extending the scope of reproducible neural architecture search. ar Xiv preprint ar Xiv:2001.00326, 2020. Xuanyi Dong, Lu Liu, Katarzyna Musial, and Bogdan Gabrys. Nats-bench: Benchmarking nas algorithms for architecture topology and size. IEEE transactions on pattern analysis and machine intelligence, 44(7):3634 3646, 2021. Stefan Falkner, Aaron Klein, and Frank Hutter. Bohb: Robust and efficient hyperparameter optimization at scale. In International Conference on Machine Learning, pp. 1437 1446. PMLR, 2018. William Fedus, Prajit Ramachandran, Rishabh Agarwal, Yoshua Bengio, Hugo Larochelle, Mark Rowland, and Will Dabney. Revisiting fundamentals of experience replay. In International Conference on Machine Learning, pp. 3061 3071. PMLR, 2020. Carlos M Fonseca, Luís Paquete, and Manuel López-Ibánez. An improved dimension-sweep algorithm for the hypervolume indicator. In 2006 IEEE international conference on evolutionary computation, pp. 1157 1163. IEEE, 2006. Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. International Conference on Machine Learning (ICML), 2017. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. International Conference on Machine Learning (ICML), 2018. Published as a conference paper at ICLR 2024 Michael Pilegaard Hansen and Andrzej Jaszkiewicz. Evaluating the quality of approximations to the non-dominated set. Citeseer, 1994. Zhimin Hou, Kuangen Zhang, Yi Wan, Dongyu Li, Chenglong Fu, and Haoyong Yu. Off-policy maximum entropy reinforcement learning: Soft actor-critic with advantage weighted mixture policy (sac-awmp). ar Xiv preprint ar Xiv:2002.02829, 2020. Hisao Ishibuchi, Hiroyuki Masuda, Yuki Tanigaki, and Yusuke Nojima. Modified distance calculation in generational distance and inverted generational distance. In Evolutionary Multi-Criterion Optimization: 8th International Conference, EMO 2015, Guimarães, Portugal, March 29 April 1, 2015. Proceedings, Part II 8, pp. 110 125. Springer, 2015. Moksh Jain, Emmanuel Bengio, Alex Hernandez-Garcia, Jarrid Rector-Brooks, Bonaventure F.P. Dossou, Chanakya Ekbote, Jie Fu, Tianyu Zhang, Micheal Kilgour, Dinghuai Zhang, Lena Simine, Payel Das, and Yoshua Bengio. Biological sequence design with GFlow Nets. International Conference on Machine Learning (ICML), 2022. Moksh Jain, Sharath Chandra Raparthy, Alex Hernández-Garcia, Jarrid Rector-Brooks, Yoshua Bengio, Santiago Miret, and Emmanuel Bengio. Multi-objective gflownets. In International Conference on Machine Learning, pp. 14631 14653. PMLR, 2023. Minsu Kim, Joohwan Ko, Dinghuai Zhang, Ling Pan, Taeyoung Yun, Woochang Kim, Jinkyoo Park, and Yoshua Bengio. Learning to scale logits for temperature-conditional gflownets. ar Xiv preprint ar Xiv:2310.02823, 2023a. Minsu Kim, Taeyoung Yun, Emmanuel Bengio, Dinghuai Zhang, Yoshua Bengio, Sungsoo Ahn, and Jinkyoo Park. Local search gflownets. ar Xiv preprint ar Xiv:2310.02710, 2023b. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Salem Lahlou, Joseph D Viviano, and Victor Schmidt. torchgfn: A pytorch gflownet library. ar Xiv preprint ar Xiv:2305.14594, 2023. Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. ar Xiv preprint ar Xiv:1806.09055, 2018. Kanika Madan, Jarrid Rector-Brooks, Maksym Korablyov, Emmanuel Bengio, Moksh Jain, Andrei Nica, Tom Bosc, Yoshua Bengio, and Nikolay Malkin. Learning gflownets from partial episodes for improved convergence and stability. ar Xiv preprint ar Xiv:2209.12782, 2022. Nikolay Malkin, Moksh Jain, Emmanuel Bengio, Chen Sun, and Yoshua Bengio. Trajectory balance: Improved credit assignment in gflownets. ar Xiv preprint ar Xiv:2201.13259, 2022. Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928 1937. PMLR, 2016. Grégoire Montavon, Matthias Rupp, Vivekanand Gobre, Alvaro Vazquez-Mayagoitia, Katja Hansen, Alexandre Tkatchenko, Klaus-Robert Müller, and O Anatole von Lilienfeld. Machine learning of molecular electronic properties in chemical compound space. New Journal of Physics, 15(9): 095003, 2013. URL http://stacks.iop.org/1367-2630/15/i=9/a=095003. Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap between value and policy based reinforcement learning. Neural Information Processing Systems (Neur IPS), 2017. Andrei Cristian Nica, Moksh Jain, Emmanuel Bengio, Cheng-Hao Liu, Maksym Korablyov, Michael M Bronstein, and Yoshua Bengio. Evaluating generalization in gflownets for molecule design. In ICLR2022 Machine Learning for Drug Discovery, 2022. Mizu Nishikawa-Toomey, Tristan Deleu, Jithendaraa Subramanian, Yoshua Bengio, and Laurent Charlin. Bayesian learning of causal structure and mechanisms with gflownets and variational bayes. ar Xiv preprint ar Xiv:2211.02763, 2022. Published as a conference paper at ICLR 2024 Ling Pan, Dinghuai Zhang, Aaron Courville, Longbo Huang, and Yoshua Bengio. Generative augmented flow networks. ar Xiv preprint ar Xiv:2210.03308, 2022. Ling Pan, Nikolay Malkin, Dinghuai Zhang, and Yoshua Bengio. Better training of gflownets with local credit and incomplete trajectories. ar Xiv preprint ar Xiv:2302.01687, 2023. Rajesh Ranganath, Dustin Tran, and David Blei. Hierarchical variational models. In International conference on machine learning, pp. 324 333. PMLR, 2016. Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. In Proceedings of the aaai conference on artificial intelligence, volume 33, pp. 4780 4789, 2019. Jarrid Rector-Brooks, Kanika Madan, Moksh Jain, Maksym Korablyov, Cheng-Hao Liu, Sarath Chandar, Nikolay Malkin, and Yoshua Bengio. Thompson sampling for improved exploration in gflownets. ar Xiv preprint ar Xiv:2306.17693, 2023. Julien Roy, Pierre-Luc Bacon, Christopher Pal, and Emmanuel Bengio. Goal-conditioned gflownets for controllable multi-objective molecular design. ar Xiv preprint ar Xiv:2306.04620, 2023. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. Oliver Schutze, Xavier Esquivel, Adriana Lara, and Carlos A Coello Coello. Using the averaged hausdorff distance as a performance measure in evolutionary multiobjective optimization. IEEE Transactions on Evolutionary Computation, 16(4):504 522, 2012. Max W Shen, Emmanuel Bengio, Ehsan Hajiramezanali, Andreas Loukas, Kyunghyun Cho, and Tommaso Biancalani. Towards understanding and improving gflownet training. ar Xiv preprint ar Xiv:2305.07170, 2023. Julien Siems, Lucas Zimmer, Arber Zela, Jovita Lukasik, Margret Keuper, and Frank Hutter. Nasbench-301 and the case for surrogate benchmarks for neural architecture search. ar Xiv preprint ar Xiv:2008.09777, 2020. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=Px TIG12RRHS. Samuel Stanton, Wesley Maddox, Nate Gruver, Phillip Maffettone, Emily Delaney, Peyton Greenside, and Andrew Gordon Wilson. Accelerating bayesian optimization for biological sequence design with denoising autoencoders. In International Conference on Machine Learning, pp. 20459 20478. PMLR, 2022. David Allen Van Veldhuizen. Multiobjective evolutionary algorithms: classifications, analyses, and new innovations. Air Force Institute of Technology, 1999. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Reinforcement learning, pp. 5 32, 1992. Yutong Xie, Chence Shi, Hao Zhou, Yuwei Yang, Weinan Zhang, Yong Yu, and Lei Li. MARS: Markov molecular sampling for multi-objective drug discovery. International Conference on Learning Representations (ICLR), 2021. Joseph D Yesselman, Daniel Eiler, Erik D Carlson, Michael R Gotrik, Anne E d Aquino, Alexandra N Ooms, Wipapat Kladwang, Paul D Carlson, Xuesong Shi, David A Costantino, et al. Computational design of three-dimensional rna structure and function. Nature nanotechnology, 14(9):866 873, 2019. Chris Ying, Aaron Klein, Eric Christiansen, Esteban Real, Kevin Murphy, and Frank Hutter. Nasbench-101: Towards reproducible neural architecture search. In International Conference on Machine Learning, pp. 7105 7114. PMLR, 2019. Published as a conference paper at ICLR 2024 Joseph N Zadeh, Conrad D Steenberg, Justin S Bois, Brian R Wolfe, Marshall B Pierce, Asif R Khan, Robert M Dirks, and Niles A Pierce. Nupack: Analysis and design of nucleic acid systems. Journal of computational chemistry, 32(1):170 173, 2011. David W Zhang, Corrado Rainone, Markus Peschl, and Roberto Bondesan. Robust scheduling with gflownets. ar Xiv preprint ar Xiv:2302.05446, 2023a. Dinghuai Zhang, Ricky TQ Chen, Nikolay Malkin, and Yoshua Bengio. Unifying generative models with gflownets. ar Xiv preprint ar Xiv:2209.02606, 2022. Dinghuai Zhang, Hanjun Dai, Nikolay Malkin, Aaron Courville, Yoshua Bengio, and Ling Pan. Let the flows tell: Solving graph combinatorial optimization problems with gflownets. ar Xiv preprint ar Xiv:2305.17010, 2023b. Dinghuai Zhang, Ling Pan, Ricky TQ Chen, Aaron Courville, and Yoshua Bengio. Distributional gflownets with quantile flows. ar Xiv preprint ar Xiv:2302.05793, 2023c. Wenhu Zhou, Runjhun Saran, and Juewen Liu. Metal sensing by dna. Chemical reviews, 117(12): 8272 8325, 2017. Published as a conference paper at ICLR 2024 A OVERVIEW OF APPENDIX We give a brief overview of the appendix here. Appendix B: we discuss related work on reinforcement learning, discrete GFlow Nets and NAS Benchmark. Appendix C: we incorporate order-preserving loss with GFlow Nets criteria other than trajectory balance. Appendix D: we provide the formal descriptions and missing proofs of Proposition 1, Proposition 2 and Proposition 3 in Section 3.2. Appendix E Complete single-objective experiments. Appendix E.1 Preliminaries and tricks in the single-objective experiments. Appendix E.2, Hyper Grid Experiments: In Appendix E.2.1, we validate the effectiveness of backward KL regularization proposed in Appendix E.1. In Appendix E.2.2, we provide the ablation study of R0. In Appendix E.2.3, we plot the learned reward distribution. Appendix E.3, Molecule Experiments: we provide the implementation details for each environment and present complete experimental results. Appendix E.4, NAS Experiments: In Appendix E.4.1, we give a complete description of the NAS environment. In Appendix E.4.2, we provide the implementation details. In Appendix E.4.3, we provide experimental results on boosting the sampler. In Appendix E.4.4, we provide the ablation study on GFN-β methods, KL regularization hyperparameter λKL, and the size of the randomly generated dataset at initialization. In Appendix E.4.5, we provide the experimental results on OP methods in FM, DB and sub TB. Appendix F: Complete multi-objective experiments. In Appendix F.1, we discuss multiple multi-objective evaluation metrics. In Appendix F.2, Appendix F.3, Appendix F.4, Appendix F.5, we conduct experiments on Hyper Grid, n-gram, DNA sequence generation, and fragment-based molecule generation separately. B RELATED WORK Reinforcement Learning GFlow Nets are trained to sample proportionally the reward rather than maximize it in standard RL. However, on tree-structured DAGs (autoregressive generation) are equivalent to RL with appropriate entropy regularization or soft Q-learning and control as inference (Buesing et al., 2020; Haarnoja et al., 2017; 2018). The experiments and theoretical explanation of Bengio et al. (2021a) show how standard RL-based methods can fail in the general DAG case, while GFlow Nets can handle it well. Signal propagation over sequences of several actions in trajectory balance is also related to losses used in RL computed on subtrajectories (Nachum et al., 2017). However, compared with our proposed OP-GFNs, previous RL baselines are limited in their application to the multi-objective optimization and are inferior in the diversity of the solutions. Discrete GFlow Nets GFlow Nets were first formulated as a reinforcement learning algorithm (Bengio et al., 2021a), with discrete state and action spaces, that trains a sequential sampler that samples the terminal states with probabilities in proportion to a given reward by the flow matching objective. Bengio et al. (2021b) provides the theoretical foundations of GFlow Nets, based on flow networks defined on MDPs, and proposes the detailed balance loss that bypasses the need to sum the flows over large sets of children and parents. Later, trajectory balance (Malkin et al., 2022), sub-trajectory balance (Madan et al., 2022), augmented flow (Pan et al., 2022), forward-looking (Pan et al., 2023), quantile matching (Zhang et al., 2023c), local search (Kim et al., 2023b) were proposed to improve the credit assignments along trajectories. NAS Benchmark The first tabular NAS benchmark to be released was NAS-Bench-101 (Ying et al., 2019). This benchmark consists of 423,624 architectures trained on CIFAR-10. NAS-Bench201 (Dong & Yang, 2020) is another popular tabular NAS benchmark. The cell-based search space consists of a DAG where each edge can take on operations. The number of non-isomorphic architectures is 6,466 and all are trained on CIFAR-10, CIFAR-100, and Image Net-16-120. NATSBench (Dong et al., 2021) is an extension of NAS-Bench-201, and provides an algorithm-agnostic Published as a conference paper at ICLR 2024 benchmark for most up-to-date NAS algorithms. The search spaces in NATS-Bench includes both architecture topology and size. The DARTS (Liu et al., 2018) search space with CIFAR-10, consisting of 1018 architectures, but is not queryable. 60 000 of the architectures were trained and used to create NAS-Bench-301 (Siems et al., 2020), the first surrogate NAS benchmark. Exploration in GFlow Nets . The exploration and exploitation strategy in GFlow Nets has been analyzed recently, and we will discuss the difference here. . Rector-Brooks et al. (2023) demonstrates how Thompson sampling with GFlow Nets allows for improved exploration and optimization efficiency in GFlow Nets. We point out the key difference in the exploration strategy is that, Rector-Brooks et al. (2023) encourages the exploration by bootstrapping K different policies, while in the early stages of OP-GFNs, the learned reward is almost uniform, which naturally facilitates the exploration. As the training goes on, the learned reward gets sparser, and the exploration declines and exploitation arises. We remark that the idea of Thompson sampling can be used in the exploitation stages of OP-GFNs to further encourage the exploration in the latter stages of training. Temperature-Conditional GFlow Nets . Zhang et al. (2023a); Kim et al. (2023a) propose the Temperature-Conditional GFlow Nets (TC-GFNs) to learn the generative distribution with any given temperature. Zhang et al. (2023a) conditions the policy networks on the temperature, and Kim et al. (2023a) directly scales probability logits regarding the temperature. We point out some critical differences to OP-GFN: 1) In TC-GFNs, a suited β s prior must still be chosen. while OP-GFNs do not require such a choice. 2) TC-GFNs learn to match Rβ(x) for all β, while OP-GFNs just learn to sample with the correct ordering statistics. However, using these few statistics, OP-GFNs still achieve competitive results in both single and multi-objective optimization. 3) TC-GFNs require the scalar reward function, while OP-GFNs can be directly used in multi-objective optimization. C ORDER-PRESERVING AS A PLUGIN SOLVER In the next, we discuss how we can integrate the loss LOP into existing (non-TB) training criteria. In this case, we will introduce a hyperparameter λOP to balance the order-preserving loss and the MDP constraint loss. Although λOP is dependent on the MDP structure and the objective function, we argue that OP-GFNs are less sensitive to different choices of λOP compared to the influence of different choices of β on GFN-β methods in Appendix E.4.5. Flow Matching. In the parametrization of the flow matching, we denote the order-preserving reward by b RFM(x; θ) = P (s x) A F(s , x; θ). The loss can be written as: LOP FM(TB; θ) = Eτi TB t=1 LFM(si t; θ) + λOP LOP(XB; b RFM( ; θ)), LFM(s; θ) = (s s) A F(s , s; θ) log X (s s ) A F(s, s ; θ) is the flow matching loss for each state s. Detailed Balance. In the parametrization of the detailed balance, we denote the order-preserving reward at terminal states by b RDB(x; θ) = F(x; θ), x X. The loss can be written as: LOP DB(TB; θ) = Eτi TB t=1 LDB(si t 1, si t; θ) + λOP LOP(XB; b RDB( ; θ)), where LDB(s, s ; θ) = (log F(s; θ)PF (s |s; θ) log F(s ; θ)PB(s|s ; θ))2 , is the detailed balance loss for each transition s s . Sub Trajectory Balance. Similar to the detailed balance, the order-preserving reward is also b Rsub TB(x; θ) = F(x; θ), x X. The loss can be written as LOP sub TB(TB; θ) = Eτi TB X 0 u 0, which proves it minimize LOP(r). Proposition 5. For {xi}n i=0 X, assume the objective function u(x) is known, and u(xi) u(xj), 0 i < j n. We define two subscript sets: I1 := {i : u(xi) < u(xi), 0 i n 1}, I2 := {i : u(xi) = u(xi+1), 0 i n 1}, We introduce two auxiliary states x 1 and xn+1 for boundary conditions, such that u(x 1) = , u(xn+1) = + .5 The order-preserving reward b R(x) [1/γ, 1] is defined by minimizing the order-preserving loss with neighboring states: arg min r,r(x) [1/γ,1] LOP N({xi}n i=0 {x 1, xn+1}; r) := arg min r,r(x) [1/γ,1] i= 1 LOP(xi 1, xi; r). Let m = |I1|, and define one auxiliary function: f1(α) := αm+2 1 4 α + 3 Since f1(α), 1 γ f1(γ 1 m+1 ) are both monotonically increasing from 0 to infinity for α, γ 1. There exists unique γ0, αγ > 1 such that f1(αγ) = γ, f1(γ 1 m+1 0 ) = γ0 For γ > γ0, we have b R(x0) = αγγ 1, b R(xi+1) = αγ b R(xi), i I1, b R(xi+1) = βγ b R(xi), i I2 βγ = αγ 1 Also, minimizing LOP N will drive γ + , and hence αγ + , βγ 1 . 5Note that for regular x X, we have u(x) [0, + ) Published as a conference paper at ICLR 2024 Proof of Proposition 5. We can expand the order-preserving loss by separating two cases where u(xi 1) < u(xi) or u(xi 1) = u(xi) for 0 i n + 1. We remark that u(x 1) < u(x0) and u(xn) < u(xn+1) by the definition of x 1 and xn+1. arg min r,r(x) [1/γ,1] LOP N({xi}n i=0 {x 1, xn+1}; r) := arg min r,r(x) [1/γ,1] i= 1 LOP(xi 1, xi; r) := arg min r,r(x) [1/γ,1] i I1 { 1,n} log r(xi+1) r(xi) + r(xi+1) 1 log r(xi) r(xi) + r(xi+1) + r(xi+1) r(xi) + r(xi+1) Let LOP(r) := LOP N({xi}n i=0 {x 1, xn+1}, r) for abbreviation. We denote r(xi) by ri, 1 i n + 1, and define αi 1 such that ri = αi 1ri 1, 0 i n + 1. Since the objective LOP(r) decreases as r 1 decreases or rn+1 increases, we have b R(x 1) = 1/γ, b R(xn+1) = 1. We consider the terms involving ri, 0 i n in the order-preserving loss. (1) If u(xi 1) = u(xi) = u(xi+1), the relevant term is log ri 1 ri 1 + ri + log ri ri 1 + ri + log ri+1 ri+1 + ri + log ri ri+1 + ri ri = 1 ri + ri+1 + 1 ri + ri 1 1 Setting LOP(r) ri = 0, we have (2) If u(xi 1) = u(xi) < u(xi+1), the relevant term is log ri 1 ri 1 + ri + log ri ri 1 + ri log ri+1 ri+1 + ri + . ri = 1 ri + ri+1 1 2ri + 1 ri + ri+1 . Setting LOP(r) ri = 0, we have αi = 2 1 + αi 1 (3) If u(xi 1) < u(xi) < u(xi+1), the relevant term is LOP(r) = log ri ri 1 + ri log ri+1 ri+1 + ri + . ri = 1 ri + ri+1 + 1 ri + ri+1 1 Setting LOP(r) ri = 0, we have (4) If u(xi 1) < u(xi) = u(xi+1), the relevant term is LOP(r) = log ri ri 1 + ri 1 log ri ri+1 + ri + log ri+1 ri+1 + ri Published as a conference paper at ICLR 2024 ri = 1 ri + ri+1 + 1 ri + ri+1 3 Setting LOP(r) ri = 0, we have αi = αi 1 1 From (1), (2), (3), (4), assuming u(xi 1) < u(xi) = = u(xj) < u(xj+1), then αj = 2 1 + αj 1 1 αj 1 1 = 2 1 + αi 1 αi 1 = 2 1 + αi 1 1 αi 1+3 1 αi 1 1 αi 1+3 1 = αi 1. Therefore, we can define α := αi if u(xi) < u(xi+1), and β := αi if u(xi) = u(xi+1), and β = α 1 α+3, α 1, 0 β 1. Then, log r is piecewise linear with two different slope, log α and log β. By the definition of x 1 and xn+1, we have r0 = αr 1 = αγ 1, rn = rn+1/α = b/α = 1/α. According to the previous definition, We have α|I1|+2β|I2| = γ, i.e. α is the solution to the following equation Therefore, α = αγ by the definition of αγ. We finally need to ensure b R( ) preserves the order, i.e. if u(xi) < u(xj), we have b R(xi) < b R(xj). We can bound: b R(xj) b R(xi) αγ and the equality holds iff I2 {i, i + 1, , j 1} and j i = |I2| + 1. We have by the definition of αγ, γ0 and monotonic increasing of f1(α) and f1(γ 1 m+1 )γ 1 w.r.t. α and γ, n m > 1 αm+1 γ > γ f1(γ 1 m+1 ) γ γ > γ0, which satisfies the assumption, hence b R(xj) > b R(xi). The order-preserving loss can be explicitly written as: LOP(r) = n m 2 log βγ (1 + βγ)2 (m + 2) log αγ 1 + αγ 2 log (αγ 1)(αγ + 3) 4(αγ + 1)2 (m + 2) log αγ 1 + αγ , and taking the derivative w.r.t. αγ, we have αγ = 4n m (α2γ 1)(αγ + 3) m + 2 αγ(αγ + 2) < 0. Therefore, minimizing the order-preserving loss corresponds to making αγ + , and therefore βγ 1 and γ + . Remark. If u(x0) < u(x1) or u(xn 1) < u(xn), the auxiliary states x 1 or xn+1 are not necessary to be added. In this case, we have u(x0) = 1/γ or u(x0) = 1 without auxiliary states, following a similar argument in Proposition 1. However, if there are multiple states of minimal or maximal objective value, we need to introduce u(x 1) = or u(xn+1) = + , so that r0 or rn appears in Published as a conference paper at ICLR 2024 two terms in LOP(r), unifying our analysis and avoiding boundary condition difference. For example, if u(x0) = u(x1) without x 1, we have r0 = 1 r0 + r1 1 2r0 + 1 r0 + r1 , r1 = 1 r0 + r1 1 2r1 + 1 r0 + r1 + LOP(x1, x2; r) r0 and LOP(r) r1 cannot be zero at the same time. Proposition 6. In the sequence prepend/append MDP in Definition 1, we consider a fixed dataset {xi, x n}n i=0 with u(x0) < u(x1) < < u(xn) = u(x n). Denote s as the important substring, defined as the longest substring shared by xn, x n with length k, and sk(x) as the set of k-length substrings of x. Following Proposition 5, let the order-preserving reward b R [1/γ, 1], and the ratio αγ. We fix PB to be uniform and match the flow F( ) with b R( ) on terminal states. Then, when αγ > 4, we have EF(s ) > EF(s), s sk(x)\s , where the expectation is taken over the random positions of s in xn, x n. Proof of Proposition 6. By Proposition 5, the reward b R(xi) = b R(x0)αi γ, b R(x n) = b R(xi)αn γβγ, 0 i n. We claim that a uniform backward policy induces a uniform trajectory distribution over trajectories connecting x to s0. On constructing x, we have n 1 choices of prepending or appending. Therefore, there are 2l 1 trajectories ending at x, and each trajectory has flow b R(x) Let s of length k is preceded by a characters in x of length l, in total there are a l k 2k 1 trajectories passing through s that end at x. The average number of trajectories over a uniform distribution on 0 a l k is 2l 1 l k+1. Thus, the expected flow passing through s and ending at just x, over uniformly random positions of s in x is b R(x) l k+1 for any s, x. We have EF(s ) b R(xn)+ b R(x n) l k+1 , and for s sk(xn)\s , EF(s) b R(xn)+Pn 1 i=1 b R(x i) l k+1 . As long as 1 + αγ + + αn 1 γ < αn γβγ = αn γ = 1 αγ 1 < αγ 1 i.e. αγ > 4 is sufficient to make the inequality hold. we have EF(s ) > EF(s), s sk(xn) sk(x n)\s . Therefore, then the order-preserving reward can help correctly assign credits to the high-reward intermediate state s . Note our results does not contradict with those in Shen et al. (2023), since we are considering F(s), s sk(x)\s , instead of F(sk(x)\s ). E SINGLE-OBJECTIVE EXPERIMENTS In this section, our implementation is based on torchgfn (Lahlou et al., 2023), Shen et al. (2023) s implementation. E.1 PRELIMINARIES We introduce some important tricks we used to improve the performance in the experiments. Backward KL Regularization Fixed uniform backward distribution PB has been shown to avoid bias induced by joint training of PB and PF , but also suffers from the slow convergence (Malkin et al., 2022). In this paper, we propose the regularize the backward distribution by its KL divergence w.r.t. uniform distribution. For a trajectory τ = (s0 s1 sn = x), define the KL regularized trajectory loss LKL as LKL(τ) := 1 i=1 LKL(st), where LKL(st) := KL(PB( |st; θ)||UB( |st)), where UB( |st) is uniform distribution on valid backward actions. Such KL regularizer can be plugged into any training objective that parametrizes the backward probability PB, which includes DB and (sub)TB objective. In Appendix E.2.1, we show that KL regularizer provides the advantages of both fixed and trainable PB. Published as a conference paper at ICLR 2024 Backward Trajectories Augmentation We adopt the prioritized replay training (PRT) (Shen et al., 2023), that focuses on high reward data. We form a replay batch from X, all terminal states seen so far, so that α1 percentile of the batch is sampled from the top α2 percentile of the objective function, and the rest of the batch is sampled from the bottom 1 α2 percentile. We sample augmented trajectories from the replay batch using PB. In practice, we select α1 = 50, α2 = 10. b R(x) as a Proxy When evaluating the objective function u(x) is costly, we propose to use b R(x) as a proxy. If we want to sample k terminal states with maximal rewards, we can first sample K k terminal states, and pick states with top-k b R(x). Then, we need only evaluate u(x) on k instead of K terminal states. We define the ratio of boosting to be rboost = K/k. For GFlow Net objective parametrize F(s), we can directly let b R(x) = F(x), x X. For TB objective, we need to use Equation (3) to approximate b R(x). Since the cost of evaluating b R(x) is also non-negligible, we only adopt this strategy when obtaining u(x) directly is significantly more difficult. For example, in the neural architecture search environment (in Section 4.3), evaluating u T (x) requires training a network to step T to get the test accuracy. Training Procedure We adopt the hybrid of online and offline training of the GFlow Net. The full pseudo algorithm is summarized in Algorithm 1. Algorithm 1 Order-Preserving GFlow Net Ninit: number of forward sampled terminal states at initialization. Nround: number of rounds for forward sampling. Nnew: number of forward sampled terminal states in every round. Noff: number of backward augmented terminal states. Noff per: number of backward augmented trajectories per terminal state. Initialize: T0: Random initialized trajectories of size Ninit. θ0: GFlow Net with parameters θ = θ0: L(θ; T ): Order preserving loss defined in Equation (2). for i = 1 Nround do Update parameters θ i θi 1 on trajectories set Ti 1. Sample Nnew terminal trajectories T i with the forward action sampler parameterized by θ i. Update trajectory sets Ti Ti 1 T i . Sample Noff terminal states from Ti. Augment into trajectories T i by the uniform backward sampler, with Noff per trajectories per terminal state. Update parameters θi θ i on trajectories set T i . end for E.2 HYPERGRID Objective Function The objective function at the state x = (x1, . . . , x D) is given by u(x) = R0 + 0.5 d=1 I xd 0.5 (0.25, 0.5] + 2 d=1 I xd 0.5 (0.3, 0.4) , (7) where I is an indicator function and R0 is a constant controlling the difficulty of exploration. This objective function has peaks of height 2.5 + R0 near the four corners of the Hyper Grid, surrounded by plateaux of height 0.5 + R0. These plateau are situated on wide valley of height R0. Network Struture We use a shared encoder to parameterize the state flow estimator F(s) and transition probability estimator PF ( |s), PB( |s), and one tensor to parametrize normalizing constant Z. The encoder is an MLP with 2 hidden layers and 256 hidden dimensions. We use Re LU as the activation function. We use Adam optimizer with a learning rate of 0.1 for Zθ s parameters and a learning rate of 0.001 for the neural network s parameters. E.2.1 BACKWARD KL REGULARIZATION In this subsection, we validate the effectiveness of backward KL regularization in Appendix E.1. We set the reward R(x) = u(x) defined in Equation (7). Published as a conference paper at ICLR 2024 Following the definition of Hyper Grid in Section 4.1, we consider two grids with the same number of terminal states: a 2-dimensional grid with H = 64 and a 4-dimensional grid with H = 8. We set R0 = 0.1, 0.01, 0.001, where R0 is defined in Equation (7). We remark that larger H expects longer trajectories and smaller R0 poses greater exploration challenges since models are less likely to pass the low-reward valley. We analyze training behaviors of fixed and regularized PB. During the training, we update the model on actively sampled 1000 terminal states in each round, for 1000 rounds in total. We expect that KL regularized PB converges faster than fixed PB in hard instances, such as long trajectory lengths and small R0, and avoid bias introduced by simply trainable PB without regularization. To validate this, we graph the progression of the ℓ1 error between the target reward distribution and the observed distribution of the most recent 105 visited states, and the empirical KL divergence of learned PB during training in Figure E.1, Figure E.2. We observe that fixed PB makes it very slow to converge in ℓ1 distance in hard instances (i.e. R0 0.01 and H = 64), but a regularized PB with proper λKL, e.g. 0.1, 1, can converge faster, and keep a near-uniform backward transition probability during training. We also observe that in short trajectories, fixed or trainable PB does not have a convergence speed difference. To illustrate the bias of trainable PB without regularization, we visualize the learned sampler after 106 states by plotting the probability of each state to perform the terminal action in Figure E.3. The initial state is (0, 0), and the reward is symmetric with respect to the diagonal from (0, 0) to (63, 63). Therefore, the learned probability of terminal action should be symmetric with respect to the diagonal. We observe that standard TB training is biased towards the upper diagonal part, while fixed TB and regularized TB behave more symmetrically. empirical ℓ 0.00 0.25 0.50 0.75 1.00 state visited KL divergence 0.00 0.25 0.50 0.75 1.00 state visited = 0 Fixed P 0.00 0.25 0.50 0.75 1.00 state visited Figure E.1: Hyper Grid with D = 2, H = 64, different R0 = 0.1, 0.01, 0.001. PB is trainable with KL regularization weight λKL = 0, 0.01, 0.1, 1, 10, or PB is fixed. E.2.2 ABLATION STUDY OF R0 In this subsection, we provide the ablation study of R0 = {0, 0.0001, 0.001, 0.01, 0.1, 1} for Section 4.1. c We set (D, H) = (2, 64), (3, 32) and (4, 16), and compare TB and order-preserving TB (OP-TB). For (D, H) = (2, 64), we plot the observed distribution on 4000 most recently visited states in Figure E.4. We additionally plot the following three ratios: 1) #(distinctly visited states)/#(all the states); 2) #(distinctly visited maximal states)/ #(all the maximal states); 3) In the most recently 4000 visited states, #(distinctly maximal states)/4000 in Figure E.5. We train the network for 500 steps, 200 trajectories per step, 20 steps per checkpoint. Published as a conference paper at ICLR 2024 empirical ℓ 0.00 0.25 0.50 0.75 1.00 state visited KL divergence 0.00 0.25 0.50 0.75 1.00 state visited = 0 Fixed P 0.00 0.25 0.50 0.75 1.00 state visited Figure E.2: Hyper Grid with D = 4, H = 8, different R0 = 0.1, 0.01, 0.001. PB is trainable with KL regularization weight λKL = 0, 0.01, 0.1, 1, 10, or PB is fixed. Trainable P KL Regularized P Figure E.3: Hyper Grid with D = 2, H = 64, R0 = 0.01. We set λKL = 1 for KL regularization. We plot the probability of each state to perform the terminal action. Published as a conference paper at ICLR 2024 We remark that R0 plays a similar role as the reward exponent β to flatten or sparsify the rewards. Large R0 facilitates exploration but hinders exploitation since a perfectly trained GFlow Net will also sample non-maximal objective candidates with high probability; whereas low R0 facilitates exploitation but hinders exploration since low reward valleys hinder the discovery of maximal objective candidates far from the initial state. A good sampling algorithm should have small ratio 1), and large ratios 2), 3), which means it can sample diverse maximal states (large ratio 2), exploration), and sample only maximal states (large ratio 3), exploitation), using the fewest distinct visited states (small ratio 1), efficiency). We observe from Figure E.5 that OP-TB outperforms TB in almost every R0 in terms of three ratios. Visited states=4000 Visited states=8000 Visited states=12000 Visited states=16000 Visited states=20000 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Figure E.4: Full experiments on R0 = 1, 0.1, 0.01, 0.001, 0.0001, 0. We plot observed distribution on 4000 most recently visited states, when we have sampled 4000, 8000, 12000, 16000, 20000 states. Published as a conference paper at ICLR 2024 0 20000 40000 60000 80000 100000 (D, H) = (2, 64) 0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000 (D, H) = (3, 32) 0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000 0.0 0 20000 40000 60000 80000 100000 # States (D, H) = (4, 16) 0 20000 40000 60000 80000 100000 # States OP-TB TB, R0 = 1 TB, R0 = 0.1 TB, R0 = 0.01 TB, R0 = 0.001 TB, R0 = 0.0001 TB, R0 = 0 0 20000 40000 60000 80000 100000 # States Figure E.5: Full experiments results on the Hyper Grid with (D, H) = (2, 64), (3, 32), (4, 16) and R0 = 0, 0.0001, 0.001, 0.01, 0.1, 1. We plot the ratio 1) #(distinctly visited states)/#(all the states); 2) #(distinctly visited maximal states)/ #(all the maximal states); 3) In the most recently 4000 visited states, #(distinctly maximal states)/4000. by OP-TB (solid lines) and TB (dashed lines) method. E.2.3 b R( ) DISTRIBUTION We consider the cosine objective function. The objective function at the state x = (x1, , x D) is given by u(x) = R0 + cos(50xd) + 1 (ϕ(0) ϕ(5xd)), where ϕ is the standard normal p.d.f. Such choice offers more complex landscapes than those in Section 4.1. We set D = 2, H = 32, R0 = 1 in the Hyper Grid. Since DB directly parametrizes F(s), we adopt the OP-DB method with λOP = 0.1. Assume there are n distinct values of u, u1 < u2 < < un, among all terminal states, we calculate the mean of the learned b R( ) on all the states with objective value ui, i.e., b Ri(θ) := Avg{ b R(x) := F(x; θ), where u(x) = ui}, 1 i n. We sample 20 states per round and checkpoint the GFlow Net every 50 rounds. In Figure E.6, we plot the log normalized Ri and b Ri by linearly scaling their sum over i to 1. We observe that log b Ri is approximately linear, confirming Propositions 1 and 2, and the slope is increasing as the training goes. We note that the active training process we used in practice, will put more mass on states of nearmaximal objective values than Propositions 1 and 2 s predictions based on full batch training. In other words, we observe that the slope is larger in the near-maximal values, as explained in the following. Since we are actively training the GFlow Nets, it is more likely to collect high-value states in the later training stages. Therefore, the order-preserving method in active training pays more attention to high-value states as the training goes, resulting in a larger slope, compared with log linear s prediction in Propositions 1 and 2, in the near-maximal objective value states. Published as a conference paper at ICLR 2024 0 20 40 60 80 i-th Rank by R(x) Log Normalized Re ard R(x), Round 0 R(x), Round 50 R(x), Round 100 R(x), Round 200 R(x), Round 300 Figure E.6: Log normalized R( ) (red) and b R( ). We checkpoint the GFlow Net sampler at training round 0 (randomly initialized), 50, 100, 200, 400. E.3 MOLECULAR DESIGN E.3.1 ENVIRONMENTS In the following, we describe each environment and its training setup individually. We adopt the sequence prepend/append MDPs as proposed by Shen et al. (2023) to simplify the graph sampling problems. For instance, in this section, we employ the s EH reward with 18 fragments and approximately 107 candidates. It is worth noting that the original s EH environment (Bengio et al., 2021a) is a fragment-graph-based MDP with around 100 fragments and more than 1016 candidates. We will utilize the s EH reward with this fragment-graph-based MDP for multi-objective optimization problems in Section 5.4. Bag (|X| = 96, 889, 010, 407) A multiset of size 13, with 7 distinct elements. The base objective value is 0.01. A bag has a substructure if it contains 7 or more repeats of any letter: then it has objective value 10 with 75% chance and 30 otherwise. We define the maximal objective value candidates as the x X with objective value 30, and the total number is 3160. The policy encoder is an MLP with 2 hidden layers and 16 hidden dimensions. We use an exploration epsilon εF = 0.10. The number of active training rounds is 2000. SIX6 (TFBind8) (|X| = 65, 536) A string of length 8 of nucleotides. The objective function is wet-lab measured DNA binding activity to a human transcription factor, SIX6_REF_R1, from Barrera et al. (2016); Fedus et al. (2020). We define the optimal candidates as the top 0.5% of X as ranked by the objective function, and the number is 328. The policy encoder is an MLP with 2 hidden layers and 128 hidden dimensions. We use an exploration epsilon εF = 0.01, and a reward exponent β = 3. The number of active training rounds is 2000. PHO4 (TFBind10) (|X| = 1, 048, 576) A string of length 10 of nucleotides. The objective function is wet-lab measured DNA binding activity to yeast transcription factor PHO4, from Barrera et al. (2016); Fedus et al. (2020). We define the optimal candidates as the top 0.5% of X as ranked by the objective function, and the total number of x is 5764. The policy encoder is an MLP with 3 hidden layers and 512 hidden dimensions. We use an exploration epsilon εF = 0.01, and a reward exponent β = 3, and scale the reward to a maximum of 10. The number of active training rounds is 20000. QM9 (|X| = 161, 051) A small molecule graph (Blum & Reymond, 2009; Montavon et al., 2013). The objective function is from a proxy deep neural network predicting the HOMO-LUMO gap from density functional theory. We build by 12 building blocks with 2 stems, and generate with 5 blocks per molecule. We define the optimal candidates as the top 0.5% of X as ranked by the objective function, and the total number is 805. The policy encoder is an MLP with 2 hidden layers and 1024 hidden dimensions. We use an exploration epsilon εF = 0.10, and a reward exponent β = 5, and scale the reward to a maximum of 100. The number of active training rounds is 2000. Published as a conference paper at ICLR 2024 s EH (|X| = 34, 012, 224) A small molecule graph. The objective function is from a proxy model trained to predict binding affinity to soluble epoxide hydrolase. Specifically, we use a gradientboosted regressor on the graph neural network predictions from the proxy model provided by Bengio et al. (2021a) and Kim et al. (2023b), to memorize the model. Their proxy model achieved an MSE of 0.375, and a Pearson correlation of 0.905. We build by 18 building blocks with 2 stems and generate 6 blocks per molecule. We define the optimal candidates as the top 0.1% of X as ranked by the objective function, and the total number is 34013. The policy encoder is an MLP with 2 hidden layers and 1024 hidden dimensions. We use an exploration epsilon εF = 0.05, and a reward exponent β = 6, and scale the reward to a maximum of 10. The number of active training rounds is 2000. E.3.2 EXPERIMENTAL DETAILS Our implementation is adapted from official implementation from Shen et al. (2023); Kim et al. (2023b).6 We follow their environment and training settings in the codes. Network structure . When training the GFNs and OP-GFNs, instead of parametrizing forward and backward transition by policy parametrization, we first parametrize the edge flow F(s s ; θ), and define the forward transition probability as F(s s ; θ)/ P s :s s A F(s s ; θ). We clip the gradient norm to a maximum of 10.0, and the policy logit to a maximum of absolute value of 50.0. We initialize log Zθ to be 5.0, which is smaller than the ground-truth Z in every environment. We use Adam optimizer with a learning rate of 0.01 for Zθ s parameters and a learning rate of 0.0001 for the neural network s parameters. We do not share the policy parameters between forward and backward transition functions. When training the RL-based methods, we set the actor and critic networks initial learning rate to be 0.0001. We set the A2C s entropy regularization parameter to be 0.01, and SQL s temperature parameter to be 0.01. Training and Evaluation . In each round, we first update the model using the on-policy batch of size 32 and then perform an additional update on one off-policy batch of size 32, which was generated by the backward trajectories augmentation PRT from the replay buffer. The number of active training rounds varies for different tasks, see Appendix E.3.1 for details. To monitor the training process, for every 10 active rounds, we sample 128 monitoring candidates from the current training policy without random noise injection. At each monitoring point, we record the average of the value of the top 100 candidates ranked by the objective function, and the number of optimal candidates being found, among all the generated candidates. E.4 NEURAL ARCHITECTURE SEARCH E.4.1 NAS ENVIRONMENT In this subsection, we study the neural architecture search environment NATS-Bench (Dong et al., 2021), which includes three datasets: CIFAR10, CIFAR-100, and Image Net-16-120. We choose the topology search space in NATS-Bench, i.e. the densely connected DAG of 4 nodes and the operation set of 5 representative candidates. The representative operations O = {ok}5 k=1 are 1) zero, 2) skip connection, 3) 1-by-1 convolution, 4) 3-by-3 convolution, and 5) 3-by-3 average pooling layer. Each architecture can be uniquely determined by a sequence x = {o ij O}1 i