# multiobjective_gflownets__382b4d5b.pdf

Multi-Objective GFlow Nets

Moksh Jain 1 2 Sharath Chandra Raparthy 1 2 * Alex Hernandez-Garcia 1 2 Jarrid Rector-Brooks 1 2

Yoshua Bengio 1 2 3 Santiago Miret 4 Emmanuel Bengio 5

We study the problem of generating diverse candidates in the context of Multi-Objective Optimization. In many applications of machine learning such as drug discovery and material design, the goal is to generate candidates which simultaneously optimize a set of potentially conflicting objectives. Moreover, these objectives are often imperfect evaluations of some underlying property of interest, making it important to generate diverse candidates to have multiple options for expensive downstream evaluations. We propose Multi-Objective GFlow Nets (MOGFNs), a novel method for generating diverse Pareto optimal solutions, based on GFlow Nets. We introduce two variants of MOGFNs: MOGFN-PC, which models a family of independent sub-problems defined by a scalarization function, with rewardconditional GFlow Nets, and MOGFN-AL, which solves a sequence of sub-problems defined by an acquisition function in an active learning loop. Our experiments on wide variety of synthetic and benchmark tasks demonstrate advantages of the proposed methods in terms of the Pareto performance and importantly, improved candidate diversity, which is the main contribution of this work.

1. Introduction

Decision making in practical applications usually involves reasoning about multiple, often conflicting, objectives (Keeney et al., 1993). Consider the example of in-silico drug discovery, where the goal is to generate novel drug-like molecules that effectively inhibit a target, are easy to synthesize, and possess a safety profile for human use (Dara et al., 2021). These objectives often exhibit mutual incompatibil-

Work done during an internship at Recursion 1Universit e de Montr eal 2Mila - Quebec AI Institute 3CIFAR Fellow & IVADO 4Intel Labs 5Recursion. Correspondence to: Moksh Jain <mokshjn00@gmail.com>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

ity as molecules that are effective against a target may also have detrimental effects on humans, making it infeasible to find a single molecule that maximizes all the objectives simultaneously. Instead, the goal in these Multi-Objective Optimization (MOO; Ehrgott, 2005; Miettinen, 2012) problems is to identify candidate molecules that are Pareto optimal, covering the best possible trade-offs between the objectives.

A less appreciated aspect of multi-objective problems is that the objectives to optimize are usually underspecified proxies which only approximate the true design objectives. For instance, the binding affinity of a molecule to a target is an imperfect approximation of the molecule s inhibitory effect against the target in the human body. In such scenarios it is important to not only cover the Pareto front but also to generate sets of diverse candidates for each Pareto optimal solution in order to increase the likelihood of success of the generated candidates in expensive downstream evaluations, such as in-vivo tests and clinical trials (Jain et al., 2022).

The benefits of generating diverse candidates are twofold. First, by diversifying the set of candidates we obtain an advantage similar to Bayesian ensembles: we reduce the risk of failure that might occur due to the imperfect generalization of learned proxy models. Diverse candidates should lie in different regions of the input-space manifold where the objective of interest might be large (considering the uncertainty in the output of the proxy model). Second, experimental measurements such as in-vitro assays may not reflect the ultimate objectives of interest, such as efficacy in human bodies. Multiple candidates may have the same assay score, but different downstream efficacy, so diversity in candidates increases odds of success. Existing approaches for MOO overlook this aspect of diversity and instead focus primarily on generating Pareto optimal solutions.

Generative Flow Networks (GFlow Nets; Bengio et al., 2021a;b) are a recently proposed family of probabilistic models which tackle the problem of diverse candidate generation. Contrary to the reward maximization view of prevalent Reinforcement Learning (RL) and Bayesian optimization (BO) approaches, GFlow Nets sample candidates with probability proportional to their reward. Sampling candidates, as opposed to greedily generating them implicitly encourages diversity in the generated candidates. GFlow Nets

Multi-Objective GFlow Nets

have shown promising results in single objective problems such as molecule generation (Bengio et al., 2021a) and biological sequence design (Jain et al., 2022).

In this paper, we propose Multi-Objective GFlow Nets (MOGFNs), which leverage the strengths of GFlow Nets and existing MOO approaches to enable the generation of diverse Pareto optimal candidates. We consider two variants of MOGFNs (a) Preference-Conditional GFlow Nets (MOGFN-PC) which leverage the decomposition of MOO into single objective sub-problems through scalarization, and (b) MOGFN-AL, which leverages the transformation of MOO into a sequence of single objective sub-problems within the framework of multi-objective Bayesian optimization. Our contributions are as follows:

C1 We introduce a novel framework of MOGFNs to tackle the practically significant and previously unexplored problem of diverse candidate generation in MOO.

C2 Through experiments on challenging molecule generation and sequence generation tasks, we demonstrate that MOGFN-PC generates diverse Pareto-optimal candidates. This is the first successful application and empirical validation of reward-conditional GFlow Nets (Bengio et al., 2021b).

C3 In a challenging active learning task for designing fluorescent proteins, we show that MOGFN-AL results in significant improvements to sample efficiency as well as diversity of generated candidates.

C4 We perform a thorough analysis on the key components of MOGFNs to provide insights into design choices that affect performance.

2. Background

2.1. Multi-Objective Optimization

Multi-objective Optimization (MOO) involves finding a set of feasible candidates x X which simultaneously maximize d objectives R(x) = [R1(x), . . . , Rd(x)]:

max x X R(x). (1)

When these objectives are conflicting, there is no single x which simultaneously maximizes all objectives. Consequently, MOO adopts the concept of Pareto optimality, which describes a set of solutions that provide optimal tradeoffs among the objectives.

Given x1, x2 X, x1 is said to dominate x2, written (x1 x2), iff Ri(x1) Ri(x2) i {1, . . . , d} and k {1, . . . , d} such that Rk(x1) > Rk(x2). A candidate x is Pareto-optimal if there exists no other solution x X which dominates x . In other words, for a Pareto-optimal candidate it is impossible to improve one objective without

sacrificing another. The Pareto set is the set of all Paretooptimal candidates in X, and the Pareto front is defined as the image of the Pareto set in objective-space.

Diversity: Since the objectives are not guaranteed to be injective, any point on the Pareto front can be the image of several candidates in the Pareto set. This designates diversity in candidate space. Capturing all the diverse candidates, corresponding to a point on the Pareto front, is critical for applications such as in-silico drug discovery, where the objectives Ri (e.g. binding affinity to a target protein) are mere proxies for the more expensive downstream measurements (e.g., effectiveness in clinical trials on humans). This notion of diversity of candidates is typically not captured by existing approaches for MOO.

2.1.1. APPROACHES FOR TACKLING MOO

While, there exist many approaches tackling MOO problems (Ehrgott, 2005; Miettinen, 2012; Pardalos et al., 2017), in this work, we consider two distinct approaches that decompose the MOO problem into a family of single objective sub-problems. These approaches, described below, are well-suited for the GFlow Net formulations we introduce in Section 3.

Scalarization: In scalarization, a set of weights (preferences) ωi are assigned to the objectives Ri, where ωi 0 and Pd i=1 ωi = 1. The MOO problem can then be decomposed into single-objective sub-problems of the form maxx X R(x|ω), where R(x|ω) is called a scalarization function, which combines the d objectives into a scalar. Solutions to these sub-problems capture all Pareto-optimal solutions to the original MOO problem depending on the choice of R(x|ω) and characteristics of the Pareto front. Weighted Sum Scalarization, R(x|ω) = Pd i=1 ωi Ri(x), for instance, captures all Pareto optimal candidates for problems with a convex Pareto front (Ehrgott, 2005). On the other hand, Weighted Tchebycheff, R(x|ω) = min 1 i d ωi|Ri(x) z i |,

where z i denotes an ideal value for objective Ri, captures all Pareto optimal solutions even for problems with a nonconvex Pareto front (Choo & Atkins, 1983; Pardalos et al., 2017). As such, scalarization transforms the multi-objective optimization into a family of independent single-objective sub-problems.

Multi-Objective Bayesian Optimization: In many applications, the objectives Ri can be expensive to evaluate, thereby making sample efficiency essential. Multi Objective Bayesian optimization (MOBO) (Shah & Ghahramani, 2016; Garnett, 2022) builds upon BO to tackle these scenarios. MOBO relies on a probabilistic model ˆf which approximates the objectives R (oracles). ˆf is typically a multi-task Gaussian Process (Shah & Ghahramani, 2016).

Multi-Objective GFlow Nets

Notably, as the model is Bayesian, it captures the epistemic uncertainty in the predictions due to limited data available for training, which can be used as a signal for prioritizing potentially useful candidates. The optimization is performed over M rounds, where each round i consists of fitting the surrogate model ˆf on the data Di accumulated from previous rounds, and using this model to generate a batch of b candidates {x1, . . . , xb} to be evaluated with the oracles R, resulting in Bi = {(x1, y1), . . . , (xb, yb)}. The evaluated batch B is then incorporated into the data for the next round Di+1 = Di B. The batch of candidates in each round is generated by maximizing an acquisition function a which combines the predictions from the surrogate model along with its epistemic uncertainty into a single scalar utility score. The acquisition function quantifies the utility of a candidate given the candidates evaluated so far. Effectively, MOBO decomposes the MOO into a sequence of single objective optimization problems of the form max{x1,...,xb} 2X a({x1, . . . , xb}; ˆf).

2.2. GFlow Nets

GFlow Nets (Bengio et al., 2021a;b) are a family of probabilistic models that learn a stochastic policy to generate compositional objects x X, such as a graph describing a candidate molecule, through a sequence of steps, with probability proportional to their reward R(x). If R : X R+

has multiple modes then i.i.d samples from π R gives a good coverage of the modes of R, resulting in a diverse set of candidate solutions. The sequential construction of x X can be described as a trajectory τ T in a weighted directed acyclic graph (DAG) G = (S, E)1, starting from an empty object s0 and following actions a A as building blocks. For example, a molecular graph may be sequentially constructed by adding and connecting new nodes or edges to the graph. Let s S, or state, denote a partially constructed object. Transitions between states s a s E indicate that action a at state s leads to state s . Sequences of such transitions form constructive trajectories.

The GFlow Net forward policy PF ( |s) is a distribution over the children of state s. An object x can be generated by starting at s0 and sampling a sequence of actions iteratively from PF . Similarly, the backward policy PB( |s) is a distribution over the parents of state s and can generate backward trajectories starting at any terminal x, e.g., iteratively sampling from PB starting at x shows a way x could have been constructed. Let π(x) be the marginal likelihood of sampling trajectories terminating in x following PF , and partition function Z = P

x X R(x). The learning problem solved by GFlow Nets is to learn a forward policy PF such that the marginal likelihood of sampling any ob-

1If the object is constructed in a canonical order (say a string constructed from left to right), G is a tree.

ject π(x) is proportional to its reward R(x). In this paper we adopt the trajectory balance (TB; Malkin et al., 2022) learning objective. The trajectory balance objective learns PF ( |s; θ), PB( |s; θ), Zθ parameterized by θ, which approximate the forward and backward policies and partition function such that π(x) R(x)

Z , x X. We refer the reader to Bengio et al. (2021b); Malkin et al. (2022) for a more thorough introduction to GFlow Nets.

3. Multi-Objective GFlow Nets

In this section, we introduce Multi-Objective GFlow Nets (MOGFNs) to tackle the problem of diverse Pareto optimal candidate generation. The two MOGFN variants we discuss respectively exploit the decomposition of MOO problems into a family of independent single objective subproblems or a sequence of single objective sub-problems.

3.1. Preference-Conditional GFlow Nets

As discussed in Section 2.1, given an appropriate scalarization function, candidates optimizing each sub-problem maxx X R(x|ω) correspond to a single point on the Pareto front. As the objectives are often imperfect proxies for some underlying property of interest we aim to generate diverse candidates for each sub-problem. One naive way to achieve this is by solving each independent sub-problem with a separate GFlow Net. However, this approach not only poses significant computational challenges in terms of training a large number of GFlow Nets, but also fails to take advantage of the shared structure present between the sub-problems.

Reward-conditional GFlow Nets (Bengio et al., 2021b) are a generalization of GFlow Nets that learn a single conditional policy to simultaneously model a family of distributions associated with a corresponding family of reward functions. Let C denote a set of values c, with each c C inducing a unique reward function R(x|c). We can define a family of weighted DAGs {Gc = (Sc, E) , c C} which describe the construction of x X, with conditioning information c available at all states in Sc. Having c as input allows the policy to learn the shared structure across different values of c. We denote PF ( |s, c) and PB( |s , c) as the conditional forward and backward policies, Z(c) = P

x X R(x|c) as the conditional partition function and π(x|c) as the marginal likelihood (given c) of sampling trajectories τ from PF terminating in x. The learning objective in reward-conditional GFlow Nets is thus estimating PF ( |s, c) such that π(x|c) R(x|c). Exploiting the shared structure enables a single conditional policy (e.g., a neural net taking c and s as input and actions probabilities as outputs) to model the entire family of reward functions simultaneously. Moreover, the policy can generalize to values of c not seen during training.

Multi-Objective GFlow Nets

The MOO sub-problems possess a similar shared structure, induced by the preferences. Thus, we can leverage a single reward-conditional GFlow Net instead of a set of independent GFlow Nets to model the sub-problems simultaneously. Formally, Preference-conditional GFlow Nets (MOGFN-PC) are reward-conditional GFlow Nets with the preferences ω d over the set of objectives {R1(x), . . . , Rd(x)} as the conditioning variable. MOGFN-PC models the family of reward functions defined by the scalarization function R(x|ω). MOGFN-PC is a general approach and can accommodate any scalarization function, be it existing ones discussed in Section 2.1 or novel scalarization functions designed for a particular task. To illustrate this flexibility, we introduce the Weighted-log-sum (WL): R(x|ω) = Qd i=1 Ri(x)ωi scalarization function. We hypothesize that the weighted sum in log space can potentially can help in scenarios where all objectives are to be optimized simultaneously, and the scalar reward from Weighted-Sum can be dominated by a single reward. The scalarization function is a key component for MOGFN-PC, and we further study the empirical impact of various scalarization functions in Section 6.

Training MOGFN-PC The procedure to train MOGFNPC, or any reward-conditional GFlow Net, closely follows that of a standard GFlow Net and is described in Algorithm 1. The objective is to learn the parameters θ of the forward and backward conditional policies PF ( |s, ω; θ) and PB( |s , ω; θ), and the log-partition function log Zθ(ω) such that π(x|ω) R(x|ω). To this end, we consider an extension of the trajectory balance objective:

L(τ, ω; θ) = log Zθ(ω) Q

s s τ PF (s |s, ω; θ) R(x|ω) Q

s s τ PB(s|s , ω; θ)

(2) One important component is the distribution p(ω) used to sample preferences during training. p(ω) influences the regions of the Pareto front that are captured by MOGFN-PC. In our experiments, we use a Dirichlet(α) to sample preferences ω which are encoded with thermometer encoding (Buckman et al., 2018) when input to the policy in some of the tasks. Following prior work, we apply an exponent β for the reward R(x|ω), i.e. π(x|ω) R(x|ω)β. This incentivizes the policy to focus on the modes of R(x|ω), which is critical for generation of high reward diverse candidates. By changing β we can trade-off the diversity for higher rewards. We study the impact of these choices empirically in Section 6.

MOGFN-PC and MOReinforce MOGFN-PC is closely related to MOReinforce (Lin et al., 2021) in that both learn a preference-conditional policy to sample Paretooptimal candidates. The key difference is the learning objective: MOReinforce uses a multi-objective variant of REINFORCE (Williams, 1992), whereas MOGFN-PC uses a

preference-conditional GFlow Net objective (Equation (2)). MOReinforce, given a preference ω will converge to generating a single candidate that maximizes R(x|ω). MOGFN-PC, on the other hand, samples from R(x|ω), resulting in generation of diverse candidates from the Pareto set according to the preferences specified by ω. This ability to generate diverse Pareto optimal candidates is a key feature of MOGFN-PC, whose advantage is demonstrated empirically in Section 5.

3.2. Multi-Objective Active Learning with GFlow Nets

In many applications, the objective functions of interest are often computationally expensive to evaluate. Consider the drug discovery scenario, where evaluating objectives such as the binding energy of a candidate molecule to a target even in imperfect simulations can take several hours. Sample efficiency, in terms of number of evaluations of the objective functions, is therefore critical.

We introduce MOGFN-AL which leverages GFlow Nets to generate candidates in each round of a multi-objective Bayesian optimization loop. MOGFN-AL tackles MOO through a sequence of single-objective sub-problems defined by acquisition function a. MOGFN-AL is a multiobjective extension of GFlow Net-AL (Jain et al., 2022). Here, we apply MOGFN-AL for biological sequence design summarized in Algorithm 2 (Appendix A), building upon the framework proposed by Stanton et al. (2022). This problem was previously studied by Seff et al. (2019) and has connections to denoising autoencoders (Bengio et al., 2013).

In existing work applying GFlow Nets for biological sequence design, the GFlow Net policy generates the sequences token-by-token (Jain et al., 2022). While this offers greater flexibility to explore the space of sequences, it can be prohibitively slow when the sequences are large. In contrast, we use GFlow Nets to propose candidates at each round i by generating mutations for existing candidates x ˆPi where ˆPi is the set of non-dominated candidates in Di. Given a sequence x, the GFlow Net, through a sequence of stochastic steps, generates a set of mutations m = {(li, vi)}T i=1 where l {1, . . . , |x|} is the location to be replaced and v A is the token to replace x[l] while T is the number of mutations. Let x m be the sequence resulting from mutations m on sequence x. The reward for a set of sampled mutations for x is the value of the acquisition function on x m, R(m, x) = a(x m| ˆf). This mutation based approach scales better to tasks with longer sequences while still affording ample exploration in sequence space for generating diverse candidates. We demonstrate this empirically in Section 5.3.

Multi-Objective GFlow Nets

4. Related Work

Evolutionary Algorithms (EA) Traditionally, evolutionary algorithms such as NSGA-II have been widely used in various multi-objective optimization problems (Ehrgott, 2005; Konak et al., 2006; Blank & Deb, 2020). More recently, Miret et al. (2022) incorporated graph neural networks into evolutionary algorithms enabling them to tackle large combinatorial spaces. Unlike MOGFNs, evolutionary algorithms are required to solve each instance of a MOO from scratch rather than by amortizing computation during training in order to quickly generate solutions at run-time. Evolutionary algorithms, however, can be augmented with MOGFNs for generating mutations to improve efficiency, as in Section 3.2.

Multi-Objective Reinforcement Learning MOO problems have also received significant interest in the RL literature (Hayes et al., 2022). Traditional approaches broadly consist of learning sets of Pareto-dominant policies (Roijers et al., 2013; Van Moffaert & Now e, 2014; Reymond et al., 2022). Recent work has focused on extending Deep RL algorithms for multi-objective settings, e.g., with Envelope MOQ (Yang et al., 2019), MO-MPO (Abdolmaleki et al., 2020; 2021) , and MOReinforce (Lin et al., 2021). A general shortcoming of RL-based approaches is their objective focuses on discovering a single mode of the reward function, and thus hardly generate diverse candidates, an issue that also persists in the multi-objective setting. In contrast, MOGFNs sample candidates proportional to the reward, implicitly resulting in diverse candidates.

Multi-Objective Bayesian Optimization (MOBO) Bayesian optimization (BO) has been used in the context of MOO when the objectives are expensive to evaluate and sample efficiency is a key consideration. MOBO approaches consist of learning a surrogate model of the true objective functions, which is used to define an acquisition function such as expected hypervolume improvement (Emmerich et al., 2011; Daulton et al., 2020; 2021) and max-value entropy search (Belakaria et al., 2019), as well as scalarization-based approaches (Paria et al., 2020; Zhang & Golovin, 2020). Abdolshah et al. (2019) and Lin et al. (2022) study the MOBO problem in the setting with preferences over the different objectives. Stanton et al. (2022) proposed La MBO, which uses language models in conjunction with BO for multi-objective sequence design problems. While recent work (Konakovic Lukovic et al., 2020; Maus et al., 2022) studies the problem of generating diverse candidates in the context of MOBO, it is limited to local optimization near Pareto-optimal candidates in low-dimensional continuous problems. As such, the key drawbacks of MOBO approaches are that they typically do not consider the need for diversity in generated candidates

and that they mainly consider continuous low-dimensional state spaces. As we discuss in Section 3.2, MOBO approaches can be augmented with GFlow Nets for diverse candidate generation in discrete spaces.

Other Approaches Zhao et al. (2022) introduced La MOO which tackles the MOO problem by iteratively splitting the candidate space into smaller regions, whereas Daulton et al. (2022) introduce MORBO, which performs BO in parallel on multiple local regions of the candidate space. Both these methods, however, are limited to continuous candidate spaces.

5. Empirical Results

In this section, we present our empirical findings across a wide range of tasks ranging from sequence design to molecule generation. Through our experiments, we aim to answer the following questions:

Q1 Can MOGFNs model the preference-conditional reward distribution?

Q2 Can MOGFNs sample Pareto-optimal candidates?

Q3 Are candidates sampled by MOGFNs diverse?

Q4 Do MOGFNs scale to high-dimensional problems relevant in practice?

We obtain positive experimental evidence for Q1-Q4.

Metrics: We rely on standard MOO metrics such as the Hypervolume (HV) and R2 indicators, as well as the Generational Distance+ (GD+). To measure diversity we use the Top-K Diversity and Top-K Reward metrics of Bengio et al. (2021a). We detail all metrics in Appendix C. For all our empirical evaluations we follow the same protocol. First, we sample a set of preferences which are fixed for all the methods. For each preference we sample 128 candidates from which we pick the top 10, compute their scalarized reward and diversity, and report the averages over preferences. We then use these samples to compute the HV and R2 indicators. We pick the best hyperparameters for all methods based on the HV and report the mean and standard deviation over 3 seeds for all quantities.

Baselines: We consider the closely related MOReinforce (Lin et al., 2021) as a baseline. We also study its variants MOSoft QL and MOA2C which use Soft QLearning (Haarnoja et al., 2017) and A2C (Mnih et al., 2016) in place of REINFORCE. We additionally compare against Envelope-MOQ (Yang et al., 2019), another popular multi-objective reinforcement learning method. For fragment-based molecule generation we consider an additional baseline, MARS (Xie et al., 2021), a relevant MCMC approach for this task. Notably, we do not consider baselines like La MOO (Zhao et al., 2022) and MORBO (Daulton

Multi-Objective GFlow Nets

0.0 0.2 0.4 0.6 0.8

3 Unigrams (Conflicting)

Pareto Front

0.0 0.2 0.4 0.6 0.8

3 Bigrams (Correlated)

Pareto Front

Figure 1. (a) The distribution learned by MOGFN-PC (Top) almost exactly matches the ground truth distribution (Bottom), in particular capturing all the modes, on hypergrid of size 32 32 with 3 objectives. (b) and (c) Pareto front of candidates generated by MOGFN-PC on the N-grams task illustrates that the MOGFN-PC can model conflicting and correlated objectives well. et al., 2022) as they are designed for continuous spaces and rely on latent representations from pre-trained models for discrete tasks like molecule generation, making the comparisons unfair as the rest of the methods are trained on significantly fewer molecules. Additionally, it is not clear to apply them on tasks like DNA Aptamer design, where pretrained models are not available.

5.1. Synthetic Tasks

5.1.1. HYPER-GRID

We first study the ability of MOGFN-PC to capture the preference-conditional reward distribution in a multiobjective version of the Hyper Grid task from Bengio et al. (2021a). The goal here is to navigate proportional to a reward within a Hyper Grid. We consider the following objectives for our experiments: brannin(x), currin(x), shubert(x)2.

Since the state space is small, we can compute the distribution learned by MOGFN-PC in closed form. In Figure 1a, we visualize π(x|ω), the distribution learned by MOGFNPC conditioned on a set of fixed preference vectors ω and contrast it with the true distribution R(x|ω) in a 32 32 hypergrid with 3 objectives. We observe that π( |ω) and R( |ω) are very similar. To quantify this, we compute Ex [|π(x|ω) R(x|ω)/Z(ω)|] averaged over a set of 64 preferences, and find a difference of about 10 4. Note that MOGFN-PC is able to capture all the modes in the distribution, which suggests the candidates sampled from π would be diverse. Further, we compute the GD+ metric for the Pareto front of candidates generated with MOGFNPC, which comes up to an average value of 0.42. For more details about the task and the additional results, refer to Appendix D.1.

2We present additional results with more objectives in Appendix D.1

5.1.2. N-GRAMS TASK

We consider version of the synthetic sequence design task from Stanton et al. (2022). The task consists of generating strings with the objectives given by occurrences of a set of d n-grams. MOGFN-PC adequately models the tradeoff between conflicting objectives in the 3 Unigrams task as illustrated by the Pareto front of generated candidates in Figure 1b. For the 3 Bigrams task with correlated objectives (as there are common characters in the bigrams), Figure 1c demonstrates MOGFN-PC generates candidates which can simultaneously maximize multiple objectives. We refer the reader to Appendix D.2 for more task details and additional results with different number of objectives and varying sequence length. In the results summarized in Table 7 in the Appendix, MOGFN-PC outperforms the baselines in terms of the MOO metrics while generating diverse candidates. Note that as occurrences of n-grams is the reward, the diversity is limited by the performance, i.e. high scoring sequences will have lower diversity, explaining higher diversity of MOSoft QL. We also observe that the MOReinforce and Envelope-MOQ baselines struggle in this task potentially due to longer trajectories with sparse rewards.

5.2. Benchmark Tasks

We first consider a small-molecule generation task based on the QM9 dataset (Ramakrishnan et al., 2014). We generate molecules atom-by-atom and bond-by-bond with up to 9 atoms and use 4 reward signals. The main reward is obtained via a MXMNet (Zhang et al., 2020) proxy trained on QM9 to predict the HOMO-LUMO gap. The other rewards are Synthetic Accessibility (SA), a molecular weight target, and a molecular log P target. Rewards are normalized to be between 0 and 1, but the gap proxy can exceed 1, and so is clipped at 2. The preferences ω are input to the policy as is, without thermometer encoding as we find no significant difference between the two choices. We train the models

Multi-Objective GFlow Nets

with 1M molecules and present the results in Table 1, showing that MOGFN-PC outperforms all baselines in terms of Pareto performance and diverse candidate generation.

Table 1. Atom-based QM9 task: MOGFN-PC exceeds Diversity and Pareto performance on QM9 task with HUMO-LUMO gap, SA, QED and molecular weight objectives compared to baselines.

Algorithm Reward ( ) Diversity ( ) HV ( ) R2 ( )

Envelope QL 0.65 0.06 0.85 0.01 1.26 0.05 5.80 0.20 MOReinforce 0.57 0.12 0.53 0.08 1.35 0.01 4.65 0.03 MOA2C 0.61 0.05 0.39 0.28 1.16 0.08 6.28 0.67 MOGFN-PC 0.76 0.00 0.93 0.00 1.40 0.18 2.44 1.88

Table 2. Fragment-based Molecule Generation Task: Diversity and Pareto performance on the Fragment-based drug design task with s EH, QED, SA and molecular weight objectives.

Algorithm Reward ( ) Diversity ( ) HV ( ) R2 ( )

Envelope QL 0.70 0.10 0.15 0.05 0.74 0.01 3.51 0.10 MARS 0.85 0.008 1.94 0.03 MOReinforce 0.41 0.07 0.01 0.007 0 0 9.88 1.06 MOA2C 0.76 0.16 0.48 0.39 0.75 0.01 3.35 0.02 MOGFN-PC 0.89 0.05 0.75 0.01 0.90 0.01 1.86 0.08

5.2.2. FRAGMENT-BASED MOLECULE GENERATION

We evaluate our method on the fragment-based (Kumar et al., 2012) molecular generation task of Bengio et al. (2021a), where the task is to generate molecules by linking fragments to form a junction tree (Jin et al., 2020). The main reward function is obtained via a pretrained proxy, available from Bengio et al. (2021a), trained on molecules docked with Autodock Vina (Trott & Olson, 2010) for the s EH target. The other rewards are based on Synthetic Accessibility (SA), drug likeness (QED), and a molecular weight target. We detail the reward construction in Appendix D.4. The preferences ω are input to the policy directly for this task as well. Similarly to QM9, we train MOGFN-PC to generate 1M molecules and report the results in Table 2. We observe that MOGFN-PC is consistently outperforming baselines not only in terms of HV and R2, but also candidate diversity score. Note that we do not report reward and diversity scores for MARS, since the lack of preference conditioning would make it an unfair comparison.

5.2.3. DNA SEQUENCE GENERATION

A practical example of a design problem where the GFlow Net graph is a tree is the generation of DNA aptamers, single-stranded nucleotide sequences that are popular in biological polymer design due to their specificity and affinity as sensors in crowded biochemical environments (Zhou et al., 2017; Corey et al., 2022; Yesselman et al., 2019; Kilgour et al., 2021). We generate sequences by adding one nucleobase (A, C, T or G) at a time, with a maximum length of 60

bases. We consider three objectives: the free energy of the secondary structure calculated with the software NUPACK (Zadeh et al., 2011), the number of base pairs and the inverse of the sequence length (to favour shorter sequences).

The results summarized in Table 3 indicate that MOReinforce achieves the best MOO performance. However, we note that it finds a quasi-trivial solution with the pattern GCGCGC... for most lengths, yielding low diversity. In contrast, MOGFN-PC obtains much better diversity and Top-K rewards with slightly lower Pareto performance. See Appendix D.5 for further discussion.

Table 3. DNA Sequence Design Task: Diversity and Pareto performance of various algorithms on DNA sequence generation task with free energy, number of base pairs and inverse sequence length objectives.

Algorithm Reward ( ) Diversity ( ) HV ( ) R2 ( )

Envelope-MOQ 0.238 0.042 0.0 0.0 0.163 0.013 5.657 0.673 MOReinforce 0.105 0.002 0.6178 0.209 0.629 0.002 1.925 0.003 MOSoft QL 0.446 0.010 32.130 0.542 0.163 0.014 5.565 0.170 MOGFN-PC 0.682 0.021 18.131 0.981 0.517 0.006 2.432 0.002

5.3. Active Learning

Finally, to evaluate MOGFN-AL, we consider the Proxy RFP task from Stanton et al. (2022), with the aim of discovering novel proteins with red fluorescence properties, optimizing for folding stability and solvent-accessible surface area. We adopt all the experimental details (described in Appendix D.6) from Stanton et al. (2022), using MOGFNAL for candidate generation. In addition to La MBO, we use a model-free (NSGA-2) and model-based EA from Stanton et al. (2022) as baselines. We observe in Figure 2a that MOGFN-AL results in significant gains to the improvement in Hypervolume relative to the initial dataset, in a given budget of black-box evaluations. In fact, MOGFN-AL is able to match the performance of La MBO within about half the number of black-box evaluations.

Figure 2b illustrates the Pareto frontier of candidates generated with MOGFN-AL, which dominates the Pareto frontier of the initial dataset. As we the candidates are generated by mutating sequences in the existing Pareto front, we also highlight the sequences that are mutations of each seqeunce in the initial dataset with the same color. To quantify the diversity of the generated candidates we measure the average e-value from DIAMOND (Buchfink et al., 2021) between the initial Pareto front and the Pareto frontier of generated candidates. Table 2c shows that MOGFN-AL generates candidates that are more diverse than the baselines.

Multi-Objective GFlow Nets

0 200 400 600 800 1000 Black-Box Evaluations

Rel. Hypervolume

MOGFN-AL La MBO MTGP + NEHVI + GA NSGA-2 (Model-Free)

11000 12000 SASA

Ds Red.M1 m Scarlet Ds Red.T4 Ad Red m Rouge RFP630 Old Frontier New Frontier

Algorithm Diversity ( )

NSGA-2 0.07 0.03 MTGP + NEHVI + GA 0.12 0.02 La MBO 0.18 0.03 MOGFN-AL 0.25 0.01

Figure 2. MOGFN-AL demonstrates a substantial advantage in terms of (a) Relative Hypervolume, and (b) the Pareto frontier of candidates generated by MOGFN-AL dominates the Pareto front of the initial dataset, while being more diverse (c) than the baselines .

6. Analysis

In this section, we isolate the important components of MOGFN-PC: the distribution p(ω) for sampling preferences during training, the reward exponent β and the scalarization function R(x|ω) to understand their impact on the performance. Figure 3 summarizes the results of the analysis on the 3 Bigrams task (Section 5.1.2) and the fragment-based molecule generation task (Section 5.2.1) with additional results in the Appendix B.

Impact of p(ω). To examine the effect of p(ω), which controls the coverage of the Pareto front, we set it to Dirichlet(α) and vary α {0.1, 1, 10}. This results in ω being sampled from different regions of d. Specifically, α = 1 corresponds to a uniform distribution over d, α > 1 is skewed towards the center of d whereas α < 1 is skewed towards the corners of d. In Figure 3a we observe that α = 1 results in the best performance. Despite the skewed distribution with α = 0.1 and α = 10, we still achieve performance close to that of α = 1 indicating that MOGFN-PC is able to interpolate to preferences not sampled during training.

Choice of scalarization R(x|ω). The set of R(x|ω) for different ω specifies the family of MOO sub-problems and thus has a critical impact on the Pareto performance. Figure 3b illustrates results for the Weighted Sum (WS), Weighted-log-sum (WL) and Weighted Tchebycheff (WT) scalarizations. Note that we do not compare the Top-K Reward as different scalarizations cannot be compared directly.

R2 0.00 0.25 0.50 0.75 1.00

R2 0.00 0.25 0.50 0.75 1.00

Dir(0.1) Dir(1) Dir(10)

(a) p(ω) Diversity

0.00 0.25 0.50 0.75 1.00

0.00 0.25 0.50 0.75 1.00

(b) R(x|ω) Reward

R2 0.00 0.25 0.50 0.75 1.00

R2 0.00 0.25 0.50 0.75 1.00

16 32 48 96

Figure 3. Analysing the effect of p(ω), R(x|ω) and β in the 3 Bigrams (left) and fragment-based molecule generation (right) tasks. The metrics are normalized by the maximum value obtained. As key takeaways, Dirichlet(α = 1) for p(ω) and weighted sum (WS) scalarization have the best performance in both tasks while the choice of β is task-dependent.

WS scalarization results in the best performance. We suspect the poor performance of WT and WL are in part also due to the less smooth reward landscapes they induce.

Impact of β. During training, β controls the concentration of the reward density around modes of the distribution. For large values of β, the reward density around the modes become more peaky and vice-versa. In Figure 3c we present the results obtained by varying β {16, 32, 48, 96}. As β increases, MOGFN-PC is incentivized to generate samples closer to the modes of R(x|ω), resulting in better Pareto performance. However, with high β values, the reward density is concentrated close to the modes and there is a negative impact on the diversity of the candidates. High values of β also make the optimization harder, resulting in poorer performance. As such β is a task specific parameter which can be tuned to identify the best trade-off between

Multi-Objective GFlow Nets

Pareto performance and diversity but also affects training efficiency.

7. Conclusion

In this work, we study Multi-Objective GFlow Nets (MOGFN) for the generation of diverse Pareto-optimal candidates. We present two instantiations of MOGFNs: MOGFN-PC, which models a family of independent single-objective sub-problems with conditional GFlow Nets, and MOGFN-AL, which optimizes a sequence of singleobjective sub-problems. We demonstrate the efficacy of MOGFNs empirically on wide range of tasks ranging from molecule generation to protein design.

As a limitation, we note that the benefits of MOGFNs are limited in problems where the rewards have a single mode (see Section 5.2.3). For certain practical applications, we are interested only in a specific region of the Pareto front. Future work may explore gradient-based techniques to learn p(ω) for more structured exploration of the preference space. Within the context of MOFGN-AL, an interesting research avenue is the development of preference-conditional acquisition functions.

Code The code for the experiments is available at https://github.com/GFNOrg/ multi-objective-gfn.

Acknowledgements

We would like to thank Nikolay Malkin, Yiding Jiang and Julien Roy for valuable feedback and comments. The research was enabled in part by computational resources provided by the Digital Research Alliance of Canada (https://alliancecan.ca/en) and Mila (https: //mila.quebec). We also acknowledge funding from CIFAR, IVADO, Intel, Samsung, IBM, Genentech, Microsoft.

Abdolmaleki, A., Huang, S., Hasenclever, L., Neunert, M., Song, F., Zambelli, M., Martins, M., Heess, N., Hadsell, R., and Riedmiller, M. A distributional view on multiobjective policy optimization. In International Conference on Machine Learning, pp. 11 22. PMLR, 2020.

Abdolmaleki, A., Huang, S. H., Vezzani, G., Shahriari, B., Springenberg, J. T., Mishra, S., TB, D., Byravan, A., Bousmalis, K., Gyorgy, A., et al. On multi-objective policy optimization as a tool for reinforcement learning. ar Xiv preprint ar Xiv:2106.08199, 2021.

Abdolshah, M., Shilton, A., Rana, S., Gupta, S., and

Venkatesh, S. Multi-objective bayesian optimisation with preferences over objectives. Advances in neural information processing systems, 32, 2019.

Belakaria, S., Deshwal, A., and Doppa, J. R. Max-value entropy search for multi-objective bayesian optimization. Advances in Neural Information Processing Systems, 32, 2019.

Bengio, E., Jain, M., Korablyov, M., Precup, D., and Bengio, Y. Flow network based generative models for noniterative diverse candidate generation. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021a. URL https://openreview.net/forum? id=Arn2E4IRj EB.

Bengio, Y., Yao, L., Alain, G., and Vincent, P. Generalized denoising auto-encoders as generative models. Advances in neural information processing systems, 26, 2013.

Bengio, Y., Deleu, T., Hu, E. J., Lahlou, S., Tiwari, M., and Bengio, E. Gflownet foundations, 2021b.

Blank, J. and Deb, K. pymoo: Multi-objective optimization in python. IEEE Access, 8:89497 89509, 2020.

Buchfink, B., Reuter, K., and Drost, H.-G. Sensitive protein alignments at tree-of-life scale using diamond. Nature methods, 18(4):366 368, 2021.

Buckman, J., Roy, A., Raffel, C., and Goodfellow, I. Thermometer encoding: One hot way to resist adversarial examples. In International Conference on Learning Representations, 2018.

Choo, E. U. and Atkins, D. R. Proper efficiency in nonconvex multicriteria programming. Mathematics of Operations Research, 8(3):467 470, 1983.

Cock, P. J., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., et al. Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11):1422 1423, 2009.

Corey, D. R., Damha, M. J., and Manoharan, M. Challenges and opportunities for nucleic acid therapeutics. nucleic acid therapeutics, 32(1):8 13, 2022.

Dance, A. et al. The hunt for red fluorescent proteins. Nature, 596(7870):152 153, 2021.

Dara, S., Dhamercherla, S., Jadav, S. S., Babu, C., and Ahsan, M. J. Machine learning in drug discovery: a review. Artificial Intelligence Review, pp. 1 53, 2021.

Multi-Objective GFlow Nets

Daulton, S., Balandat, M., and Bakshy, E. Differentiable expected hypervolume improvement for parallel multiobjective bayesian optimization. Advances in Neural Information Processing Systems, 33:9851 9864, 2020.

Daulton, S., Balandat, M., and Bakshy, E. Parallel bayesian optimization of multiple noisy objectives with expected hypervolume improvement. Advances in Neural Information Processing Systems, 34:2187 2200, 2021.

Daulton, S., Eriksson, D., Balandat, M., and Bakshy, E. Multi-objective bayesian optimization over highdimensional search spaces. In The 38th Conference on Uncertainty in Artificial Intelligence, 2022. URL https: //openreview.net/forum?id=r5IEvv Is9xq.

Ehrgott, M. Multicriteria optimization, volume 491. Springer Science & Business Media, 2005.

Emmerich, M. T., Deutz, A. H., and Klinkenberg, J. W. Hypervolume-based expected improvement: Monotonicity properties and exact computation. In 2011 IEEE Congress of Evolutionary Computation (CEC), pp. 2147 2154. IEEE, 2011.

Fonseca, C., Paquete, L., and Lopez-Ibanez, M. An improved dimension-sweep algorithm for the hypervolume indicator. In 2006 IEEE International Conference on Evolutionary Computation, pp. 1157 1163, 2006. doi: 10.1109/CEC.2006.1688440.

Garnett, R. Bayesian Optimization. Cambridge University Press, 2022. in preparation.

Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energy-based policies. In International conference on machine learning, pp. 1352 1361. PMLR, 2017.

Hansen, M. P. and Jaszkiewicz, A. Evaluating the quality of approximations to the non-dominated set. Citeseer, 1994.

Hayes, C. F., R adulescu, R., Bargiacchi, E., K allstr om, J., Macfarlane, M., Reymond, M., Verstraeten, T., Zintgraf, L. M., Dazeley, R., Heintz, F., et al. A practical guide to multi-objective reinforcement learning and planning. Autonomous Agents and Multi-Agent Systems, 36(1):1 59, 2022.

Ishibuchi, H., Masuda, H., Tanigaki, Y., and Nojima, Y. Modified distance calculation in generational distance and inverted generational distance. In Gaspar-Cunha, A., Henggeler Antunes, C., and Coello, C. C. (eds.), Evolutionary Multi-Criterion Optimization, pp. 110 125, Cham, 2015. Springer International Publishing. ISBN 978-3-319-15892-1.

Jain, M., Bengio, E., Hernandez-Garcia, A., Rector-Brooks, J., Dossou, B. F., Ekbote, C. A., Fu, J., Zhang, T., Kilgour, M., Zhang, D., et al. Biological sequence design with gflownets. In International Conference on Machine Learning, pp. 9786 9801. PMLR, 2022.

Jin, W., Barzilay, R., and Jaakkola, T. Chapter 11. junction tree variational autoencoder for molecular graph generation. Drug Discovery, pp. 228 249, 2020. ISSN 20413211.

Keeney, R., Raiffa, H., L, K., and Meyer, R. Decisions with Multiple Objectives: Preferences and Value Trade-Offs. Wiley series in probability and mathematical statistics. Applied probability and statistics. Cambridge University Press, 1993. ISBN 9780521438834. URL https:// books.google.ca/books?id=GPE6ZAq Grno C.

Kilgour, M., Liu, T., Walker, B. D., Ren, P., and Simine, L. E2edna: Simulation protocol for dna aptamers with ligands. Journal of Chemical Information and Modeling, 61(9):4139 4144, 2021.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Bengio, Y. and Le Cun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http: //arxiv.org/abs/1412.6980.

Konak, A., Coit, D. W., and Smith, A. E. Multi-objective optimization using genetic algorithms: A tutorial. Reliability Engineering and System Safety, 91(9):992 1007, 2006. ISSN 09518320. doi: 10.1016/j.ress.2005.11.018.

Konakovic Lukovic, M., Tian, Y., and Matusik, W. Diversity-guided multi-objective bayesian optimization with batch evaluations. Advances in Neural Information Processing Systems, 33:17708 17720, 2020.

Kumar, A., Voet, A., and Zhang, K. Y. Fragment based drug design: from experimental to computational approaches. Current medicinal chemistry, 19(30):5128 5147, 2012.

Landrum, G. Rdkit: Open-source cheminformatics. URL

http://www.rdkit.org.

Lin, X., Yang, Z., and Zhang, Q. Pareto set learning for neural multi-objective combinatorial optimization. In International Conference on Learning Representations, 2021.

Lin, Z. J., Astudillo, R., Frazier, P., and Bakshy, E. Preference exploration for efficient bayesian optimization with multiple outcomes. In International Conference on Artificial Intelligence and Statistics, pp. 4235 4258. PMLR, 2022.

Multi-Objective GFlow Nets

Malkin, N., Jain, M., Bengio, E., Sun, C., and Bengio, Y. Trajectory balance: Improved credit assignment in gflownets. Neural Information Processing Systems (Neur IPS), 2022.

Maus, N., Wu, K., Eriksson, D., and Gardner, J. Discovering many diverse solutions with bayesian optimization. ar Xiv preprint ar Xiv:2210.10953, 2022.

Miettinen, K. Nonlinear multiobjective optimization, volume 12. Springer Science & Business Media, 2012.

Miret, S., Chua, V. S., Marder, M., Phiellip, M., Jain, N., and Majumdar, S. Neuroevolution-enhanced multi-objective optimization for mixed-precision quantization. In Proceedings of the Genetic and Evolutionary Computation Conference, pp. 1057 1065, 2022.

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928 1937. PMLR, 2016.

Pardalos, P. M., ˇZilinskas, A., ˇZilinskas, J., et al. Nonconvex multi-objective optimization. Springer, 2017.

Paria, B., Kandasamy, K., and P oczos, B. A flexible framework for multi-objective bayesian optimization using random scalarizations. In Uncertainty in Artificial Intelligence, pp. 766 776. PMLR, 2020.

Ramakrishnan, R., Dral, P. O., Rupp, M., and Von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Scientific data, 1(1):1 7, 2014.

Reymond, M., Bargiacchi, E., and Nowe, A. Pareto conditioned networks. In 21st International Conference on Autonomous Agents and Multi-agent System. IFAAMAS, 2022.

Roijers, D. M., Vamplew, P., Whiteson, S., and Dazeley, R. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 48:67 113, 2013.

Schymkowitz, J., Borg, J., Stricher, F., Nys, R., Rousseau, F., and Serrano, L. The foldx web server: an online force field. Nucleic acids research, 33(suppl 2):W382 W388, 2005.

Seff, A., Zhou, W., Damani, F., Doyle, A., and Adams, R. P. Discrete object generation with reversible inductive construction. Advances in neural information processing systems, 32, 2019.

Shah, A. and Ghahramani, Z. Pareto frontier learning with expensive correlated objectives. In International conference on machine learning, pp. 1919 1927. PMLR, 2016.

Shrake, A. and Rupley, J. A. Environment and exposure to solvent of protein atoms. lysozyme and insulin. Journal of molecular biology, 79(2):351 371, 1973.

Stanton, S., Maddox, W., Gruver, N., Maffettone, P., Delaney, E., Greenside, P., and Wilson, A. G. Accelerating Bayesian optimization for biological sequence design with denoising autoencoders. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 20459 20478. PMLR, 17 23 Jul 2022. URL https://proceedings.mlr.press/ v162/stanton22a.html.

Trott, O. and Olson, A. J. Autodock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. Journal of computational chemistry, 31(2):455 461, 2010.

Van Moffaert, K. and Now e, A. Multi-objective reinforcement learning using sets of pareto dominating policies. The Journal of Machine Learning Research, 15(1):3483 3512, 2014.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229 256, 1992.

Xie, Y., Shi, C., Zhou, H., Yang, Y., Zhang, W., Yu, Y., and Li, L. {MARS}: Markov molecular sampling for multiobjective drug discovery. In International Conference on Learning Representations, 2021. URL https:// openreview.net/forum?id=k HSu4ebx FXY.

Yang, R., Sun, X., and Narasimhan, K. A generalized algorithm for multi-objective reinforcement learning and policy adaptation. Advances in Neural Information Processing Systems, 32, 2019.

Yesselman, J. D., Eiler, D., Carlson, E. D., Gotrik, M. R., d Aquino, A. E., Ooms, A. N., Kladwang, W., Carlson, P. D., Shi, X., Costantino, D. A., et al. Computational design of three-dimensional rna structure and function. Nature nanotechnology, 14(9):866 873, 2019.

Yun, S., Jeong, M., Kim, R., Kang, J., and Kim, H. J. Graph transformer networks. Advances in neural information processing systems, 32, 2019.

Zadeh, J. N., Steenberg, C. D., Bois, J. S., Wolfe, B. R., Pierce, M. B., Khan, A. R., Dirks, R. M., and Pierce, N. A. Nupack: Analysis and design of nucleic acid

Multi-Objective GFlow Nets

systems. Journal of computational chemistry, 32(1):170 173, 2011.

Zhang, R. and Golovin, D. Random hypervolume scalarizations for provable multi-objective black box optimization. In International Conference on Machine Learning, pp. 11096 11105. PMLR, 2020.

Zhang, S., Liu, Y., and Xie, L. Molecular mechanics-driven graph neural network with multiplex graph for molecular structures, 2020. URL https://arxiv.org/abs/ 2011.07457.

Zhao, Y., Wang, L., Yang, K., Zhang, T., Guo, T., and Tian, Y. Multi-objective optimization by learning space partition. In International Conference on Learning Representations, 2022. URL https://openreview.net/ forum?id=Flwz Vjf Mryn.

Zhou, W., Saran, R., and Liu, J. Metal sensing by dna. Chemical reviews, 117(12):8272 8325, 2017.

Multi-Objective GFlow Nets

A. Algorithms

We summarize the algorithms for MOGFN-PC and MOGFN-AL here.

Algorithm 1 Training preference-conditional GFlow Nets

Input: p(ω): Distribution for sampling preferences β: Reward Exponent δ: Mixing Coefficient for uniform actions in sampling policy N: Number of training steps Initialize: (PF (s |s, ω), PB(s|s , ω), log Z(ω)): Conditional GFlow Net with parameters θ

for i = 1 to N do

Sample preference ω p(ω) Sample trajectory τ following policy ˆπ = (1 δ)PF + δUniform Compute reward R(x|ω)β for generated samples and corresponding loss L(τ, ω; θ) as in Equation 2 Update parameters θ with gradients from the loss, θL(τ, ω) end for

Algorithm 2 Training MOGFN-AL

Input: R = {R1, . . . , Rd}: Oracles to evaluate candidates x and return true objectives (R1(x), . . . , Rd(x)) D0 = {(xi, yi)}: Initial dataset with yi = R(xi)

ˆf: Probabilistic surrogate model to model posterior over R given a dataset D a(x| ˆf): Acquisition function computing a scalar utility for x given ˆf πθ: Learnable GFlow Net policy b: Size of candidate batch to be generated N: Number of active learning rounds Initialize: ˆf, πθ for i = 1 to M do

Fit ˆf on dataset Di 1 Extract the set of non-dominated candidates ˆPi 1 from Di 1 Train πθ with to generate mutations for x ˆPi using a( | ˆf) as the reward Generate batch B = {x 1,mi, . . . , x b,mb} by sampling x i from ˆPi 1 and applying to it mutations mi sampled from πθ Evaluate batch B with R to generate ˆDi = {(x1, R(x1)), . . . , (xb, R(xb))} Update dataset Di = ˆDi Di 1 end for Result: Approximate Pareto set ˆPN

B. Additional Analysis

Can MOGFN-PC match Single Objective GFNs? To evaluate how well MOGFN-PC models the family of rewards R(x|ω), we consider a comparison with single objective GFlow Nets. More specifically, we first sample a set of 10 preferences ω1, . . . , ω10, and train a standard single objective GFlow Net using the weighted sum scalar reward for each preference. We then generate N = 128 candidates from each GFlow Net, throughout training, and compute the mean reward for the top 10 candidates for each preference. We average this top 10 reward across {ω1, . . . , ω10}, and call it Rso. We then train MOGFN-PC, and apply the sample procedure with the preferences {ω1, . . . , ω10}, and call the resulting mean of top 10 rewards Rmo. We plot the value of the ratio Rmo/Rso in Figure 4. We observe that the ratio stays close to 1, indicating that MOGFN-PC can indeed model the entire family of rewards simultaneously at least as fast as a single objective GFlow Net could.

Multi-Objective GFlow Nets

0.0 2.5 5.0 7.5 10.0 Points seen (x1000)

Average Reward Ratio

(a) 3 Bigrams task

0.00 0.25 0.50 0.75 1.00 Points seen (x M)

Average Reward Ratio

(b) Fragment-based Molecule Generation Task

Figure 4. We plot the ratio of rewards Rmo/Rso for candidates generated with MOGFN-PC (Rmo) and single-objective GFlow Nets(Rso) for a set of preferences in the (a) 3 Bigrams and (b) Fragment-based molecule generation tasks. We observe that MOGFN-PC matches and occasionally surpasses single objective GFlow Nets

Effect of Model Capacity and Architecture Finally we look at the effect of model size in training MOGFN-PC. As MOGFN-PC models a conditional distribution, an entire family of functions as we ve described before, we expect capacity to play a crucial role since the amount of information to be learned is higher than for a single-objective GFN. We increase model size in the 3 Bigrams task to see that effect, and see in Table 4 that larger models do help with performance although the performance plateaus after a point. We suspect that in order to fully utilize the model capacity we might need better training objectives.

Table 4. Analysing the impact of model size on the performance of MOGFN-PC. Each architecture choice for the policy is denoted as A-B-C where A is number of layers, B is the number of hidden units in each layer, and C is the number of attention heads.

Metrics Effect of model size

3-64-8 3-256-8 4-64-8 4-256-8

Reward ( ) 0.44 0.01 0.47 0.00 0.49 0.03 0.51 0.01

Diversity ( ) 19.79 0.08 17.13 0.38 17.53 0.15 16.12 0.04

Hypervolume ( ) 0.22 0.017 0.255 0.008 0.262 0.003 0.270 0.011

R2 ( ) 9.97 0.45 9.22 0.25 8.95 0.05 8.91 0.12

In this section we discuss the various metrics that we used to report the results in Section 5.

1. Generational Distance Plus (GD +) (Ishibuchi et al., 2015): This metric measures the euclidean distance between the solutions of the Pareto approximation and the true Pareto front by taking the dominance relation into account. To calculate GD+ we require the knowledge of the true Pareto front and hence we only report this metric for Hypergrid experiments (Section 5.1.1)

2. Hypervolume (HV) Indicator (Fonseca et al., 2006): This is a standard metric reported in Multi-Objective Optimization

Multi-Objective GFlow Nets

(MOO) works which measures the volume in the objective space with respect to a reference point spanned by a set of non-dominated solutions in Pareto front approximation.

3. R2 Indicator (Hansen & Jaszkiewicz, 1994): R2 provides a monotonic metric comparing two Pareto front approximations using a set of uniform reference vectors and a utopian point z representing the ideal solution of the MOO.

This metric provides a monotonic reference to compare different Pareto front approximations relative to a utopian point. Specifically, we define a set of uniform reference vectors λ Λ that cover the space of the MOO and then

calculate:R2(Γ, Λ, z ) = 1 |Λ| P

maxi 1,...,k{λi|z i γi|} where γ Γ corresponds to the set of

solutions in a given Pareto front approximations and z is the utopian point corresponding to the ideal solution of the MOO. Generally, R2 metric calculations are performed with z equal to the origin and all objectives transformed to a minimization setting, which serves to preserve the monotonic nature of the metric. This holds true for our experiments as well.

4. Top-K Reward This metric was originally used in (Bengio et al., 2021a), which we extend for our multi-objective setting. For MOGFN-PC, we sample N candidates per test preference and then pick the top-k candidates (k < N) with highest scalarized rewards and calculate the mean. We repeat this for all test preferences enumerated from the simplex and report the average top-k reward score.

5. Top-K Diversity This metric was also originally used in (Bengio et al., 2021a), which we again extend for our multi-objective setting. We use this metric to quantify the notion of diversity of the generated candidates. Given a distance metric d(x, y) between candidates x and y we calculate the diversity of candidates as those who have d(x, y) greater than a threshold ϵ. For MOGFN-PC, we sample N candidates per test preference and then pick the top-k candidates based on the diversity scores and take the mean. We repeat this for all test preferences sampled from simplex and report the average top-k diversity score. We use the edit distance for sequences, and 1 minus the Tanimoto similarity for molecules.

D. Additional Experimental Details

D.1. Hyper-Grid

Here we elaborate on the Hyper-Grid experimental setup which we discussed in Section 5.1.1. Consider an n-dimensional hypercube gridworld where each cell in the grid corresponds to a state. The agent starts at the top left coordinate marked as (0, 0, . . . ) and is allowed to move only towards the right, down, or stop. When the agent performs the stop action, the trajectory terminates and the agent receives a non-zero reward. In this work, we consider the following reward functions - brannin(x), currin(x), sphere(x), shubert(x), beale(x). In Figure 5, we show the heatmap for each reward function. Note that we normalize all the reward functions between 0 and 1.

Figure 5. Reward Functions Different reward function considered for Hyper Grid experiments presented in Section 5.1.1. Here the grid dimension is H = 32

Additional Results To verify the efficacy of MOGFNs across different objectives sizes, we perform some additional experiments and measure the L1 loss and the GD+ metric. In Figure 6, we can see that as the reward dimension increases,

Multi-Objective GFlow Nets

the loss and GD+ increases. This is expected because the number of rewards is indicative of the difficulty of the problem. We also present extended qualitative visualizations across more preferences in Figure 7.

Figure 6. (Left) Average test loss between the MOGFN-PC distribution and the true distribution for increasing number of objectives. (Right) GD + metrics of MOGFN-PC across objectives.

Model Details and Hyperparameters For MOGFN-PC policies we use an MLP with two hidden layers each consisting of 64 units. We use Leaky Re LU as our activation function as in (Bengio et al., 2021a). All models are trained with learning rate=0.01 with the Adam optimizer (Kingma & Ba, 2015) and batch size=128. We sample preferences ω from Dirichlet(α) where α = 1.5. We try two encoding techniques for encoding preferences - 1) Vanilla encoding where we just use the raw values of the preference vectors and 2) Thermometer encoding (Buckman et al., 2018). In our experiments we have not observed significant difference in performance difference.

Figure 7. Extended Qualitative Visualizations for Hypergrid epxeriments

D.2. N-grams Task

Task Details The task is to generate sequences of some maximum length L, which we set to 36 for the experiments in Section 5.1.2. We consider a vocabulary (actions) of size 21, with 20 characters ["A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"] and a special token to indicate the end of sequence. The rewards {Ri}d i=1 are defined by the number of occurrences of a given set of n-grams in a sequence x. For instance, consider ["AB", "BA"] as the n-grams. The rewards for a sequence x = ABABC would be [2, 1]. We consider two choices of n-grams: (a) Unigrams: the number of occurrences of a set of unigrams induces conflicting objectives since we cannot increase the number of occurrences of a monogram without replacing another in a string of a particular length, (b) Bigrams: given common characters within the bigrams, the occurrences of multiple bigrams can be increased simultaneously within a string of a fixed length. We also consider different sizes for the set of n-grams considered, i.e. different number of objectives. This allows us to evaluate the behaviour of MOGFN-PC on a variety of objective spaces. We summarize the specific objectives used in our experiments in Table 5. We normalize the rewards to [0, 1] in our experiments.

Model Details and Hyperparameters We build upon the implementation from Stanton et al. (2022) for the task: https: //github.com/samuelstanton/lambo. For the string generation task, the backward policy PB is trivial (as there is only one parent for each node s S), so we only have to parameterize PF and log Z. As PF ( |s, ω) is a conditional policy, we use a Conditional Transformer encoder as the architecture. This consists of a Transformer encoder (Vaswani et al., 2017) with 3 hidden layers of dimension 64 and 8 attention heads to embed the current state (string generated so far) s. We have an MLP which embeds the preferences ω which are encoded using thermometer encoding with 50 bins. The embeddings of

Multi-Objective GFlow Nets

Table 5. Objectives considered for the N-grams task Objectives n-grams

2 Unigrams ["A", "C"] 2 Bigrams ["AC", "CV"] 3 Unigrams ["A", "C", "V"] 3 Bigrams ["AC", "CV", "VA"] 4 Unigrams ["A", "C", "V", "W"] 4 Bigrams ["AC", "CV", "VA", "AW"]

the state and preferences are concatenated and passed to a final MLP which generates a categorical distribution over the actions (vocabulary token). We use the same architecture for the baselines using a conditional policy MOReinforce and MOSoft QL. For Envelope-MOQ, which does not condition on the preferences, we use a standard Transformer-encoder with a similar architecture. We present the hyperparameters we used in Table 6. Each method is trained for 10,000 iterations with a minibatch size of 128. For the baselines we adopt the official implementations released by the authors for MOReinforce https://github.com/Xi-L/PMOCO and Envelope-MOQ https://github.com/Runzhe Yang/MORL.

Table 6. Hyperparameters for N-grams Task Hyperparameter Values Learning Rate (PF ) {0.01, 0.05, 0.001, 0.005, 0.0001} Learning Rate (Z) {0.01, 0.05, 0.001} Reward Exponent: β {16, 32, 48} Uniform Policy Mix: δ {0.01, 0.05, 0.1}

Additional Results

We present some additional results for the n-grams task. First, Table 7 summarizes the numerical results for the experiments in subsubsection 5.1.2. Further, We consider different number of objectives d {2, 4} in Table 8 and Table 9 respectively. As with the experiments in Section 5.1.2 we observe that MOGFN-PC outperforms the baselines in Pareto performance while achieving high diversity scores. In Table 10, we consider the case of shorter sequences L = 24. MOGFN-PC continues to provide significant improvements over the baselines. There are two trends we can observe considering the N-grams task holistically:

1. As the sequence size increases the advantage of MOGFN-PC becomes more significant.

2. The advantage of MOGFN-PC increases with the number of objectives.

Multi-Objective GFlow Nets

Table 7. N-Grams Task: Diversity and Pareto performance of various algorithms on for the 3 Bigrams and 3 Unigrams tasks with MOGFN-PC achieving superior Pareto performance.

Algorithm 3 Bigrams 3 Unigrams

Reward ( ) Diversity ( ) HV ( ) R2 ( ) Reward ( ) Diversity ( ) HV ( ) R2 ( ) Envelope-MOQ 0.05 0.04 0 0 0.012 0.013 19.66 0.66 0.08 0.015 0 0 0.023 0.011 21.18 0.72 MOReinforce 0.12 0.02 0 0 0.015 0.021 20.32 0.93 0.03 0.001 0 0 0.036 0.009 21.04 0.51 MOSoft QL 0.28 0.03 21.09 0.65 0.093 0.025 15.79 0.23 0.36 0.01 23.131 0.6736 0.105 0.014 12.80 0.26 MOGFN-PC 0.44 0.01 19.79 0.08 0.220 0.017 9.97 0.45 0.38 0.00 22.71 0.24 0.121 0.015 11.39 0.17

Table 8. N-grams Task. 2 Objectives

Algorithm 2 Bigrams 2 Unigrams

Reward ( ) Diversity ( ) HV ( ) R2 ( ) Reward ( ) Diversity ( ) HV ( ) R2 ( )

Envelope-MOQ 0.05 0.001 0 0 0.0 0.0 7.74 0.42 0.09 0.02 0 0 0.014 0.001 5.73 0.09

MOReinforce 0.12 0.01 0 0 0.151 0.023 0.031 0.43 0.04 0 0 0.222 0.013 2.54 0.06

MOSoft QL 0.37 0.03 19.40 0.91 0.247 0.031 2.92 0.39 0.46 0.02 22.05 0.04 0.253 0.003 2.54 0.02

MOGFN-TB 0.51 0.04 20.65 0.58 0.321 0.011 2.31 0.04 0.48 0.01 22.15 0.22 0.267 0.007 2.24 0.03

Table 9. N-grams Task. 4 Objectives

Algorithm 4 Bigrams 4 Unigrams

Reward ( ) Diversity ( ) HV ( ) R2 ( ) Reward ( ) Diversity ( ) HV ( ) R2 ( )

Envelope-MOQ 0 0 0 0 0 0 85.23 2.78 0 0 0 0 0 0 80.36 3.16

MOReinforce 0.01 0.00 0 0 0.001 0.001 60.42 1.52 0.00 0.00 0 0 0 0 79.12 4.21

MOSoft QL 0.12 0.04 24.32 1.21 0.013 0.001 39.31 1.35 0.22 0.02 24.18 1.43 0.019 0.005 31.46 2.32

MOGFN-TB 0.23 0.02 20.31 0.43 0.055 0.017 24.42 1.44 0.33 0.01 23.24 0.23 0.063 0.032 23.31 2.03

Table 10. N-grams Task. Shorter Sequences

Algorithm 3 Bigrams 3 Unigrams

Reward ( ) Diversity ( ) HV ( ) R2 ( ) Reward ( ) Diversity ( ) HV ( ) R2 ( )

Envelope-MOQ 0.07 0.01 0 0 0.027 0.010 16.21 0.48 0.08 0.02 0 0 0.031 0.015 20.13 0.41

MOReinforce 0.18 0.01 0 0 0.053 0.031 13.35 0.82 0.07 0.02 0 0 0.041 0.009 19.25 0.41

MOSoft QL 0.31 0.02 20.12 0.51 0.143 0.019 12.79 0.41 0.38 0.02 21.13 0.35 0.109 0.011 12.12 0.24

MOGFN-PC 0.45 0.02 19.62 0.04 0.225 0.009 9.82 0.23 0.39 0.01 21.94 0.21 0.125 0.015 10.91 0.14

Multi-Objective GFlow Nets

Reward Details As mentioned in Section 5.2.1, we consider four reward functions for our experiments. The first reward function is the HUMO-LUMO gap, for which we rely on the predictions of a pretrained MXMNet (Zhang et al., 2020) model trained on the QM9 dataset (Ramakrishnan et al., 2014). The second reward is the standard Synthetic Accessibility score which we calculate using the RDKit library (Landrum), to get the reward we compute (10 SA)/9. The third reward function is molecular weight target. Here we first calculate the molecular weight of a molecule using RDKit, and then construct a reward function of the form e (mol Wt 105)2/150 which is maximized at 105. Our final reward function is a log P target, e (log P 2.5)2/2, which is again calculated with RDKit and is maximized at 2.5.

Model Details and Hyperparameters We sample new preferences for every episode from a Dirichlet(α), and encode the desired sampling temperature using a thermometer encoding (Buckman et al., 2018). We use a graph neural network based on a graph transformer architecture (Yun et al., 2019). We transform this conditional encoding to an embedding using an MLP. The embedding is then fed to the GNN as a virtual node, as well as concatenated with the node embeddings in the graph. The model s action space is to add a new node to the graph, a new bond, or set node or bond properties (like making a bond a double bond). It also has a stop action. For more details please refer to the code provided in the supplementary material. We summarize the hyperparameters used in Table 11.

Hyperparameter Value Learning Rate (PF ) 0.0005 Learning Rate (Z) 0.0005 Reward Exponent: β 32 Batch Size: 64 Number of Embeddings 64 Uniform Policy Mix: δ 0.001 Number of layers 4

Table 11. Hyperparameters for QM9 Task

D.4. Fragments

More Details As mentioned in Section 5.2.2, we consider four reward functions for our experiments. The first reward function is a proxy trained on molecules docked with Autodock Vina (Trott & Olson, 2010) for the s EH target; we use the weights provided by Bengio et al. (2021a). We also use synthetic accessibility, as for QM9, and a weight target region (instead of the specific target weight used for QM9), ((300 - molwt) / 700 + 1).clip(0, 1) which favors molecules with a weight of under 300. Our final reward function is QED which is again calculated with RDKit.

Model Details and Hyperparameters We again use a graph neural network based on a graph transformer architecture (Yun et al., 2019). The experimental protocol is similar to QM9 experiments discussed in Appendix D.3. We additionally sample from a lagged model whose parameters are updated as θ = τθ + (1 τ)θ. The model s action space is to add a new node, by choosing from a list of fragments and an attachment point on the current molecular graph. We list all hyperparameters used in Table 12.

Hyperparameter Value Learning Rate (PF ) 0.0005 Learning Rate (Z) 0.0005 Reward Exponent: β 96 Batch Size: 256 Sampling model τ 0.95 Number of Embeddings 128 Number of layers 6

Table 12. Hyperparameters for Fragments

Additional Results We also present in Figure 8 a view of the reward distribution produced by MOGFN-PC. Generally, the

Multi-Objective GFlow Nets

model is able to find good near-Pareto-optimal samples, but is also able to spend a lot of time exploring. The figure also shows that the model is able to respect the preference conditioning, and remains capable of generating a diverse distribution rather than a single point.

Figure 8. Fragment-based molecule generation: See Appendix D.4.

In the off-diagonal plots of Figure 8, we show pairwise scatter plots for each objective pair; the Pareto front is depicted with a red line; each point corresponds to a molecule generated by the model as it explores the state space; color is density (linear viridis palette). The diagonal plots show two overlaid informations: a blue histogram for each objective, and an orange scatter plot showing the relationship between preference conditioning and generated molecules. The effect of this conditioning is particularly visible for seh (top left) and wt (bottom right). As the preference for the s EH binding reward gets closer to 1, the generated molecules reward for s EH gets closer to 1 as well. Indeed, the expected shape for such a scatter plot is a triangular-ish shape: when the preference ωi for reward Ri is close to 1, the model is expected to generate objects with a high reward for Ri; as the preference ωi gets further away from 1, the model can generate anything, including objects with a high Ri that is, unless there is a trade off between objectives, in which case in cannot; this is the case for the seh objective, but not for the wt objective, which has a more triangular shape.

D.5. DNA Sequence Design

Task Details The set of building blocks here consists of the bases["A", "C", "T", "G"] in addition to a special end of sequence token. In order to compute the free energy and number of base with the software NUPACK (Zadeh et al., 2011), we used 310 K as the temperature. The inverse of the length L objective was calculated as 30

L , as 30 was the minimum length for sampled sequences. The rewards are normalized to [0, 1] for our experiments.

Model Details and Hyperparameters We use the same implementation as the N-grams task, detailed in Appendix D.2. Here we consider a 4-layer Transformer architecture, with 256 units per layer and 16 attention head instead. We detail the most relevant hyperparameters Table 13.

Discussion of Results Contrary to the other tasks on which we evaluated MOGFN-PC, for the generation of DNA aptamer sequences, our proposed model did not match the best baseline, multi-objective reinforcement learning (Lin et al., 2021), in terms of Pareto performance. Nonetheless, it is worth delving into the details in order to better understand the different

Multi-Objective GFlow Nets

Table 13. Hyperparameters tuned for DNA-Aptamers Task. Hyperparameter Values Learning Rate (PF ) {0.001, 0.0001, 0.00001, 0.000001} Learning Rate (Z) 0.001 Reward Exponent: β {40, 60, 80} Batch Size: 16 Training iterations: 10,000 Dirichlet α {0.1, 1.0, 1.5}

solutions found by the two methods. First, as indicated in Section 5, despite the better Pareto performance, the best sequences generated by the RL method have extremely low diversity (0.62), compared to MOGFN, which generates optimal sequences with diversity of 19.6 or higher. As a matter of fact, MOReinforce mostly samples sequences with the well-known pattern GCGC... for all possible lengths. Sequences with this pattern have indeed low (negative) energy and many number of pairs, but they offer little new insights and poor diversity if the model is not able to generate sequences with other distinct patterns. On the contrary, GFlow Nets are able to generate sequences with patterns other than repeating the pair of bases G and C. Interestingly, we observed that GFlow Nets were able to generate sequences with even lower energy than the best sequences generated by MOReinforce by inserting bases A and T into chains of GCGC.... Finally, we observed that one reason why MOGFN does not match the Pareto performance of MOReinforce is because for short lengths (one of the objectives) the energy and number of pairs are not successfully optimised. Nonetheless, the optimisation of energy and number of pairs is very good for the longest sequences. Given these observations, we conjecture that there is room for improving the set of hyperparameters or certain aspects of the algorithm.

Additional Results In order to better understand the impact of the main hyperparameters of MOGFN-PC in the Pareto performance and diversity of the optimal candidates, we train multiple instances by sweeping over several values of the hyperparameters, as indicated in Table 13. We present the results in Table 14. One key observation is that there seems to be a tradeoff between the Pareto performance and the diversity of the Top-K sequences. Nonetheless, even the models with the lowest diversity are able to generate much more diverse sequences than MOReinforce. Furthermore, we also observe α < 1 as the parameter of the Dirichlet distribution to sample the weight preferences, as well as higher β (reward exponent), both yield better metrics of Pareto performance but slightly worse diversity. In the case of β, this observation is consistent with the results of the analysis in the Bigrams task (Figure 3), but with Bigrams, best performance was obtained with α = 1. This is indicative of a degree of dependence on the task and the nature of the objectives.

Table 14. Analysis of the impact of α, β and the learning rate on the performance of MOGFN-PC for DNA sequence design. We observe a trade-off between the Top-K diversity and the Pareto performance.

Metrics Effect of p(ω) Effect of β Effect of the learning rate

Dir(α = 0.1) Dir(α = 1) Dir(α = 1.5) 40 60 80 10 5 10 4 10 3 10 2

Reward ( ) 0.687 0.01 0.652 0.01 0.639 0.01 0.506 0.01 0.560 0.01 0.652 0.01 0.587 0.01 0.652 0.01 0.654 0.03 0.604 0.01

Diversity ( ) 17.65 0.37 19.58 0.15 20.18 0.58 28.49 0.32 24.93 0.19 19.58 0.15 21.92 0.59 19.58 0.15 19.51 1.14 23.16 0.18

Hypervolume ( ) 0.506 0.01 0.467 0.02 0.440 0.01 0.277 0.03 0.363 0.03 0.467 0.02 0.333 0.01 0.467 0.02 0.496 0.01 0.336 0.01

R2 ( ) 2.462 0.05 2.576 0.08 2.688 0.02 4.225 0.34 2.905 0.18 2.576 0.08 3.855 0.31 2.576 0.01 2.488 0.03 3.422 0.07

D.6. Active Learning

Task Details We consider the Proxy RFP task from Stanton et al. (2022), an in silico benchmark task designed to simulate searching for improved red fluorescent protein (RFP) variants (Dance et al., 2021). The objectives considered are stability (-d G or negative change in Gibbs free energy) and solvent-accessible surface area (SASA) (Shrake & Rupley, 1973) in simulation, computed using the Fold X suite (Schymkowitz et al., 2005) and Bio Python (Cock et al., 2009). We use the dataset introduced in Stanton et al. (2022) as the initial pool of candidates D0 with |D0| = 512.

Method Details and Hyperparameters Our implementation builds upon the publicly released code from (Stanton et al., 2022): https://github.com/samuelstanton/lambo. We follow the exact experimental setup used in (Stanton

Multi-Objective GFlow Nets

et al., 2022). The surrogate model ˆf consists of an encoder with 1D convolutions (masking positions corresponding to padding tokens). We used 3 standard pre-activation residual blocks with two convolution layers, layer norm, and swish activations, with a kernel size of 5, 64 intermediate channels and 16 latent channels. A multi-task GP with an ICM kernel is defined in the latent space of this encoder, which outputs the predictions for each objective. We also use the training tricks detailed in Stanton et al. (2022) for the surrogate model. The hyperparameters, taken from Stanton et al. (2022) are shown in Table 15. The acquisiton function used is NEHVI (Daulton et al., 2021) defined as

α({xj}i j=1) = 1

t=1 HVI({ ft(xj)}i 1 j=1|Pt) + 1

t=1 HVI( ft(xj)|Pt { ft(xj)}i 1 j=1) (3)

where ft, t = 1, . . . N are independent draws from the surrogate model (which is a posterior over functions), and Pt denotes the Pareto frontier in the current dataset D under ft.

Table 15. Hyperparameters for training the surrogate model ˆf

Hyperparameter Value Shared enc. depth (# residual blocks) 3 Disc. enc. depth (# residual blocks) 1 Decoder depth (# residual blocks) 3 Conv. kernel width (# tokens) 5 # conv. channels 64 Latent dimension 16 GP likelihood variance init 0.25 GP lengthscale prior N(0.7, 0.01) # inducing points (SVGP head) 64 DAE corruption ratio (training) 0.125 DAE learning rate (MTGP head) 5.00E-03 DAE learning rate (SVGP head) 1.00E-03 DAE weight decay 1.00E-04 Adam EMA params 0., 1e-2 Early stopping holdout ratio 0.1 Early stopping relative tolerance 1.00E-03 Early stopping patience (# epochs) 32 Max # training epochs 256

We replace the La MBO candidate generation with GFlow Nets. We generate a set of mutations m = {(li, vi)} for a sequences x from the current approximation of the Pareto front ˆPi. Note that, as opposed to the sequence generation experiments, PB here is not trivial as there are multiple ways (orders) of generating the set. For our experiments, we use a uniform random PB. PF takes as input the sequence x with the mutations generated so far applied. We use a Transformer encoder with 3 layers, with hidden dimension 64 and 8 attention heads as the architecture for the policy. The policy outputs a distribution over the locations in x, {1, . . . , |x|}, and a distribution over tokens for each location. The vocabulary of actions here is the same as the N-grams task - ["A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"]. The logits of the locations of the mutations generated so far are set to -1000, to prevent generating the same sequence. The acquisition function(NEHVI) value for the mutated sequence is used as the reward. We also use a reward exponent β. To make optimization easier (as the acquisition function becomes harder to optimize with growing β), we reduce β linearly by a factor δβ at each round. We train the GFlow Net for 750 iterations in each round. Table 16 shows the MOGFN-AL hyperparameters. The active learning batch size is 16, and we run 64 rounds of optimization. Table 16 presents the hyperparameters used for MOGFN-AL.

Multi-Objective GFlow Nets

Table 16. Hyperparameters for MOGFN-AL Hyperparameter Values Learning Rate (PF ) {0.01, 0.001, 0.0001} Learning Rate (Z) {0.01, 0.001} Reward Exponent: β {16, 24} Uniform Policy Mix: δ {0.01, 0.05} Maximum number of mutations {10, 15, 20} δβ {0.5, 1, 2}