# multiobjective_optimization_by_learning_space_partition__f31d5a61.pdf

Published as a conference paper at ICLR 2022

MULTI-OBJECTIVE OPTIMIZATION BY LEARNING SPACE PARTITIONS

Yiyang Zhao Worcester Polytechnic Institute Linnan Wang Brown University Kevin Yang UC Berkeley Tianjun Zhang UC Berkeley

Tian Guo Worcester Polytechnic Institute Yuandong Tian Facebook AI Research

In contrast to single-objective optimization (SOO), multi-objective optimization (MOO) requires an optimizer to ﬁnd the Pareto frontier, a subset of feasible solutions that are not dominated by other feasible solutions. In this paper, we propose La MOO, a novel multi-objective optimizer that learns a model from observed samples to partition the search space and then focus on promising regions that are likely to contain a subset of the Pareto frontier. The partitioning is based on the dominance number, which measures how close a data point is to the Pareto frontier among existing samples. To account for possible partition errors due to limited samples and model mismatch, we leverage Monte Carlo Tree Search (MCTS) to exploit promising regions while exploring suboptimal regions that may turn out to contain good solutions later. Theoretically, we prove the efﬁcacy of learning space partitioning via La MOO under certain assumptions. Empirically, on the Hyper Volume (HV) benchmark, a popular MOO metric, La MOO substantially outperforms strong baselines on multiple real-world MOO tasks, by up to 225% in sample efﬁciency for neural architecture search on Nasbench201, and up to 10% for molecular design.

1 INTRODUCTION

Multi-objective optimization (MOO) has been extensively used in many practical scenarios involving trade-offs between multiple objectives. For example, in automobile design (Chang, 2015), we must maximize the performance of the engine while simultaneously minimizing emissions and fuel consumption. In ﬁnance (Gunantara, 2018), one prefers a portfolio that maximizes the expected return while minimizing risk.

Mathematically, in MOO we optimize M objectives f(x) = [f1(x), f2(x), . . . , f M(x)] RM:

min f1(x), f2(x), ..., f M(x) (1) s.t. x Ω

While we could set arbitrary weights for each objective to turn it into a single-objective optimization (SOO) problem, modern MOO methods aim to ﬁnd the problem s entire Pareto frontier: the set of solutions that are not dominated by any other feasible solutions1 (see Fig. 1 for illustration). The Pareto frontier yields a global picture of optimal solution structures rather than focusing on one speciﬁc weighted combination of objectives.

As a result, MOO is fundamentally different from SOO. Instead of focusing on a single optimal solution, a strong MOO optimizer should cover the search space broadly to explore the Pareto frontier. Popular quality indicators in MOO, such as hypervolume (HV), capture this aspect by computing the volume of the currently estimated frontier. Speciﬁcally, given a reference point R RM, as shown in Fig. 1(a), the hypervolume of a ﬁnite approximate Pareto set P is the M-dimensional

1Here we deﬁne dominance y f x as fi(x) fi(y) for all functions fi, and exists at least one i s.t. fi(x) < fi(y), 1 i M. That is, solution x is always better than solution y, regardless of how the M objectives are weighted.

Published as a conference paper at ICLR 2022

Search Space

Objective Space

MOO methods Sampling Method Objectives>3 MOEA/D (Zhang & Li, 2007)

CMA-ES (Igel et al., 2007a) NSGA-II (Deb et al., 2002a) NAGA-III (Deb & Jain, 2014)

PAREGO (Knowles, 2006) Bayesian optimization

q EHVI (Daulton et al., 2020)

La MOO (our approach) Space partition

Figure 1: Left: A basic setting in Multi-objective Optimization (MOO), optimizing M = 2 objectives in Eqn. 1. (a) depicts the objective space (f1, f2) and (b) shows the search space x Ω. In (a), P denotes the Pareto frontier, R is the reference point, the hypervolume HV is the space of the shaded area, and o(x) are the dominance numbers. In (b), once a few samples are collected within Ω, La MOO learns to partition the search space Ωinto sub-regions (i.e. Ωgood and Ωbad) according to the dominance number in objective space, and then focuses future sampling on the good regions that are close to the Pareto Frontier. This procedure can be repeated to further partition Ωgood and Ωbad. Right: A table shows the properties of MOO methods used in experiments.

Lebesgue measure λM of the space dominated by P and bounded from below by R. That is, HV (P, R) = λM( |P| i=1[R, yi]), where [R, yi] denotes the hyper-rectangle bounded by reference point R and yi. Consequently, the optimizer must consider the diversity of solutions in addition to their optimality.

While several previous works have proposed approaches to capture this diversity-optimality trade-off (Deb et al., 2002a; Knowles, 2006; Igel et al., 2007; Deb & Jain, 2014; Daulton et al., 2020), in this paper, we take a fundamentally different route by learning promising candidate regions from past explored samples. Ideally, to ﬁnd the Pareto frontier in as few function evaluations as possible, we want to sample heavily in the Pareto optimal set ΩP , deﬁned as the region of input vectors that corresponds to the Pareto frontier.

One way to focus samples on ΩP is to gradually narrow the full search space down to the subregion containing ΩP via partitioning. For example, in the case of quadratic objective functions, ΩP can be separated from the non-optimal set Ω\ΩP via simple linear classiﬁers (see Observation 1,2). Motivated by these observations, we thus design La MOO, a novel MOO meta-optimizer that progressively partitions regions into sub-regions and then focuses on sub-regions that are likely to contain Pareto-optimal regions, where existing solvers can help. Therefore, La MOO is a meta-algorithm.

Unlike cutting-plane methods (Loganathan & Sherali, 1987; Hinder, 2018; Vieira & Lisboa, 2019) that leverage the (sub)-gradient of convex objectives as the cutting plane, with global optimality guarantees, La MOO is data-driven: it leverages previous samples to build classiﬁers to learn the partition and focuses future samples in these promising regions. No analytical formula of objectives or their sub-gradients is needed. La MOO is a multi-objective extension of recent works (Wang et al., 2020; Yang et al., 2021) that also learn space partitions but for a single black-box objective.

Empirically, La MOO outperforms existing approaches on many benchmarks, including standard benchmarks in multi-objective black-box optimization, and real-world multi-objective problems like neural architecture search (NAS) (Cai et al., 2019; 2020) and molecule design. For example, as a meta-algorithm, La MOO combined with CMA-ES as an inner routine requires only 62.5%, 8%, and 29% as many samples to reach the same hypervolume as the original CMA-ES (Igel et al., 2007a) in Branin Currin (Belakaria et al., 2019), Vehicle Safety (Liao et al., 2008) and Nasbench201 (Dong & Yang, 2020), respectively. On average, compared to q EHVI, La MOO uses 50% samples to achieve the same performance in these problems. In addition, La MOO with q EHVI (Daulton et al., 2020) and CMA-ES require 71% and 31% fewer samples on average, compared to naive q EHVI and CMA-ES, to achieve the same performance in molecule discovery.

2 RELATED WORK

Bayesian Optimization (BO) (Zitzler et al., 2003; Knowles, 2006; Ponweiser et al., 2008; Couckuyt et al., 2014; Paria et al., 2018; Yang et al., 2019; Daulton et al., 2020) is a popular family of methods to optimize black-box single and multi-objectives. Using observed samples, BO learns a surrogate model ˆf(x), search for new promising candidates based on acquisition function built on ˆf(x), and query the quality of these candidates with the ground truth black-box objective(s). In multi-objective Bayesian optimization (MOBO), most approaches leverage Expected Hypervolume Improvement

Published as a conference paper at ICLR 2022

(EHVI) as their acquisition function (Zitzler et al., 2003; Couckuyt et al., 2014; Yang et al., 2019), since ﬁnding the Pareto frontier is equivalent to maximizing the hypervolume given a ﬁnite search space (Fleischer, 2003). There are methods (Knowles, 2006; Ponweiser et al., 2008; Paria et al., 2018) that use different acquisition functions like expected improvement (Jones et al., 1998) and Thompson sampling (Thompson, 1933). EVHI is computationally expensive: its cost increases exponentially with the number of objectives. To address this problem, q EHVI (Daulton et al., 2020) accelerates optimization by computing EHVI in parallel, and has become the state-of-the-art MOBO algorithm. In this paper, we leverage q EHVI as a candidate inner solver in our proposed La MOO algorithm.

Evolutionary algorithms (EAs) (Deb et al., 2002a; Igel et al., 2007a; Zhang & Li, 2007; Beume et al., 2007; Fang et al., 2018) are also popular methods for MOO tasks. One category of MOOEAs (Srinivas & Deb, 1994; Deb et al., 2002a; Deb & Jain, 2014) leverages Pareto dominance to simultaneously optimize all objectives. A second category (e.g., (Zhang & Li, 2007)) decomposes a multi-objective optimization problem into a number of single-objective sub-problems, converting a difﬁcult MOO into several SOOs. Another category is quality indicator-based methods, such as (Beume et al., 2007) and (Igel et al., 2007a). They scalarize the current Pareto frontier using quality indicators (e.g., HV) and transfer a MOO to a SOO. New samples are generated by crossover and mutation operations from existing ones. However the drawbacks of non-quality indicator-based methods (i.e., the ﬁrst two categories) can be not overlooked. Speciﬁcally, for MOO with many objectives, NSGA-II (Deb et al., 2002a) easily gets stuck in a dominance resistant solution (Pang et al., 2020) which is far from the true Pareto frontier. while MOEA/D perform better in MOO but how to specify the weight vector for problems with unknown Pareto front is the main challenge (Deb & Jain, 2014). In addition, A* search based algorithms are also considered to be extended to MOO (Stewart & White, 1991; Tung Tung & Lin Chew, 1992; De la Cruz et al., 2005).

Quality Indicators. Besides hypervolume, there are several other quality indicators (Van Veldhuizen & Lamont, 1998; Zitzler et al., 2000; Bosman & Thierens, 2003) for evaluating sample quality, which can be used to scalarize the MOO to SOO. The performance of a quality indicator can be evaluated by three metrics (Deng et al., 2007; Li et al., 2014), including convergence (closeness to the Pareto frontier), uniformity (the extent of the samples satisfy the uniform distribution), and spread (the extent of the obtained approximate Pareto frontier). Sec. B speciﬁcally illustrates the merits of each quality indicator. Hyper Volume is the only metric we explored that can simultaneously satisfy the evaluation of convergence, uniformity, and spread without the knowledge of the true Pareto frontier while it may suffer from expensive calculation in many-objective problems. Therefore, throughout this work, we use HV to evaluate the optimization performance of different algorithms.

3 LEARNING SPACE PARTITIONS: A THEORETICAL UNDERSTANDING

Searching in high-dimensional space to ﬁnd the optimal solution to a function is in general a challenging problem, especially when the function s properties are unknown to the search algorithm. The difﬁculty is mainly due to the curse of dimensionality: to adequately cover a d-dimensional space, in general, an exponential number of samples are needed.

For this, many works use a coarse-to-ﬁne approach: partition the search space and then focusing on promising regions. Traditionally, manually deﬁned criteria are used, e.g., axis-aligned partitions (Munos, 2011b), Voronoi diagrams (Kim et al., 2020), etc. Recently, (Wang et al., 2019; 2020; Yang et al., 2021) learn space partitions based on the data collected thus far, and show strong performance in Neur IPS black box optimization challenges (Sazanovich et al.; Kim et al.).

On the other hand, there is little quantitative understanding of space partition. In this paper, we ﬁrst give a formal theoretical analysis on why learning plays an important role in space-partition approaches for SOO. Leveraging our understanding of how space partitioning works, we propose La MOO which empirically outperforms existing So TA methods on multiple MOO benchmarks.

3.1 PROBLEM SETTING Intuitively, learning space partitions will yield strong performance if the classiﬁer can determine which regions are promising given few data points. We formalize this intuition below and show why it is better than ﬁxed and manually deﬁned criteria for space partitioning.

Consider the following sequential decision task. We have N samples in a discrete subset S0 and there exists one sample x that achieves a minimal value of a scalar function f. Note that f can be any property we want, e.g., in the Pareto optimal set. The goal is to construct a subset ST S0 after T

Published as a conference paper at ICLR 2022

steps, so that (1) x ST and (2) |ST | is as small as possible. More formally, we deﬁne the reward function r as the probability that we get x by randomly sampling from the resulting subset ST :

r := 1 |ST |P(x ST ) (2)

It is clear that 0 r 1. r = 1 means that we already found the optimal sample x .

Here we use discrete case for simplicity and leave continuous case (i.e., partitioning a region Ω0 instead of a discrete set S0) to future work. Note N could be large, so here we consider it infeasible to enumerate S0 to ﬁnd x . However, sampling from S0, as well as comparing the quality of sampled solutions are allowed. An obvious baseline is to simply set ST := S0, then rb = N 1. Now the question is: can we do better? Here we seek help from the following oracle: Deﬁnition 1 ((α, η)-Oracle). Given a subset S that contains x , after taking k samples from S, the oracle can ﬁnd a good subset Sgood with |Sgood| |S|/2 and

P (x Sgood|x S) 1 exp k η|S|α

Lemma 1. The algorithm to uniformly draw k samples in S, pick the best and return is a (1, 1)-oracle.

See Appendix for proof. Note that a (1, 1)-oracle is very weak, and is of little use in obtaining higher reward r. We typically hope for an oracle with smaller α and η (i.e., both smaller than 1). Intuitively, such oracles are more sample-efﬁcient: with few samples, they can narrow down the region containing the optimal solution x with high probability.

Note that α < 1 corresponds to semi-parametric models. In these cases, the oracle has generalization property: with substantially fewer samples than N (i.e., on the order of N α), the oracle is able to put the optimal solution x on the right side. In its extreme case when α = 0 (or parametric models), whether we classify the optimal solution x on the correct side only depends on the absolute number of samples collected in S, and is independent of its size. For example, if the function to be optimized is linear, then with d + 1 samples, we can completely characterize the property of all |S| samples.

Relation with cutting plane. Our setting can be regarded as a data-driven extension of cutting plane methods (Loganathan & Sherali, 1987; Vieira & Lisboa, 2019; Hinder, 2018) in optimization, in which a cutting plane is found at the current solution to reduce the search space. For example, if f is convex and its gradient f(x) is available, then we can set Sgood := {x : f(x0) (x x0) 0, x S0}, since for any x S0 \ Sgood, convexity gives f(x) f(x0) + f(x0) (x x0) > f(x0) and thus x is not better than current x0. However, the cutting plane method relies on certain function properties like convexity. In contrast, learning space partition can leverage knowledge about the function forms, combined with observed samples so far, to better partition the space.

3.2 REWARDS UNDER OPTIMAL ACTION SEQUENCE We now consider applying the (α, η)-oracle iteratively for T steps, by drawing kt samples from St 1 and setting St := Sgood,t 1. We assume a total sample budget K, so PT t=1 kt = K. Note that T log2 N since we halve the set size with each iteration. Now the question is twofold. (1) How can we determine the action sequences {kt} in order to maximize the total reward r? (2) Following the optimal action sequences {k t }, can r be better than the baseline rb = N 1? The answer is yes. Theorem 1. The algorithm yields a reward r lower bounded by the following:

r rb exp log 2 ηN αφ(α, T)

where rb := N 1 and φ(α, T) := (1 2 αT )/(1 2 α).

Remarks. Following Theorem 1, a key condition to make r > rb is to ensure log 2 > ηNαφ(α,T )

K . This holds if when ηNαφ(α,T )

K 0. Note that since T log2 N, the ﬁnal reward r is upper bounded by 1 (rather than goes to + ). We consider some common practical scenarios below.

Non-parametric models (α = 1). In this case, φ(α, T) 2 and the condition becomes 1

2 log 2 > ηN/K. This happens when the total sample budget K = Θ(N), i.e., on the same order of N, which means that the partitioning algorithm obtains little advantage over exhaustive search.

Published as a conference paper at ICLR 2022

Semi-parametric models (α < 1). In this case, φ(α, T) 1/(1 2 α) and the condition becomes (1 2 α) log 2 > ηN α/K. This happens when the total sample budget K = Θ(N α). In this case, we could use many fewer samples than exhaustive search to achieve better reward, thanks to the generalization property of the oracle.

Parametric models (α = 0). Now φ(α, T) = T and the condition becomes log 2 > ηT

K . Since T log2 N, the total sample budget can be set to be K = Θ(log N). Intuitively, the algorithm performs iterative halving (or binary search) to narrow down the search toward promising regions.

3.3 EXTENSION TO MULTI-OBJECTIVE OPTIMIZATION Given our understanding of space partitioning, we now extend this idea to MOO. Intuitively, we want good regions to be always picked by the space partition. For SOO, it is possible since the optimal solution is a single point. How about MOO?

Unlike SOO, in MOO we aim for a continuous region, the Pareto optimal set ΩP := {x : x = x : f(x ) f(x)}. A key variable is the regularity of ΩP : if it is highly non-regular and not captured by a simple partition boundary (ideally a parametric boundary), then learning a space partition would be difﬁcult. Interestingly, the shape of ΩP can be characterized for quadratic objectives:

Observation 1. If all fj are isotropic, fj(x) = x cj 2 2, then ΩP = Convex Hull(c1, . . . , cq).

Observation 2. If M = 2 and fj(x) = (x cj) Hj(x cj) where Hj are positive deﬁnite symmetric matrices, then there exists w1 := H2(c2 c1) and w2 := H1(c1 c2), so that for any x ΩP , w 1 (x c1) 0 and w 2 (x c2) 0.

In both cases, ΩP can be separated from non-Pareto regions Ω\ΩP via a linear hyperplane. Empirically, ΩP only occupies a small region of the entire search space (Sec. 4), and quickly focusing samples on the promising regions is critical for high sample efﬁciency.

In the general case, characterizing ΩP is analytically hard and requires domain knowledge about the objectives (Li et al., 2014). However, for MOO algorithms in practice, knowing that ΩP can be separated from Ω\ΩP via simple decision planes is already useful: we could learn such decision planes given previous data that are already collected, and sample further in promising regions.

4 LAMOO: LATENT ACTION MULTI-OBJECTIVE OPTIMIZATION

In Sec. 3, for convenience, we only analyze a greedy approach, which makes decisions on space partitions and never revises them afterwards. While this greedy approach indeed works (as shown in Sec. 5.3), an early incorrect partition could easily rule out regions that turn out to be good but weren t identiﬁed with few samples. In practice, we want to keep the decision softer: while exploiting the promising region, we also explore regions that are currently believed to be sub-optimal given limited samples. It is possible that these regions turn out to contain good solutions when more samples are available, and the oracle can then make a different partition.

To balance the trade-off between exploration and exploitation to cope with the generalization error of the learned classiﬁer, we leverage Monte Carlo Tree Search (MCTS) (Kocsis & Szepesvári, 2006) and propose our algorithm La MOO. As shown in Alg. 1, La MOO has four steps: (1) learn to partition the search space given previous observed data points Dt, which are collected {xi, f(xi)} from iterations 0 to t. (2) With this information, we partition the region into promising and non-promising regions, and learn a classiﬁer h( ) to separate them. (3) We select the region to sample from, based on the UCB value of each node. (4) We sample selected regions to obtain future data points Dt+1.

Learning Space Partitions. We construct the partition oracle using the dominance number. Let Dt be the collected samples up to iteration t and Dt,j := Dt Ωj be the samples within the region Ωj we want to partition. For each sample x Dt,j, its dominance number ot,j(x) at iteration t is deﬁned as the number of samples in Ωj that dominate x (here I[ ] is the indicator function):

ot,j(x) := X

xi Dt,j I[x f xi, x = xi] (5)

While naive computation requires O(|Dt,j|2) operations, we use Maxima Set (Kung et al., 1975) which runs in O(|Dt,j| log |Dt,j|). For x ΩP , o(x) = 0.

Published as a conference paper at ICLR 2022

Algorithm 1 La MOO Pseudocode.

1: Inputs: Initial D0 from uniform sampling, sample budget T. 2: for t = 0, . . . , T do 3: Set L {Ωroot} (collections of regions to be split). 4: while L = do 5: Ωj pop_ﬁrst_element(L), Dt,j Dt Ωj, nt,j |Dt,j|. 6: Compute dominance number ot,j of Dt,j using Eqn. 5 and train SVM model h( ). 7: If (Dt,j, ot,j) is splittable by SVM, then L L Partition(Ωj, h( )). 8: end while 9: for k = root, k is not leaf node do 10: Dt,k Dt Ωk, vt,k Hyper Volume(Dt,k), nt,k |Dt,k|.

11: k arg max c children(k) UCBt,c, where UCBt,c := vt,c + 2Cp q

2 log(nt,k)

nt,c 12: end for 13: Dt+1 Dt Dnew, where Dnew is drawn from Ωk based on q EHVI or CMA-ES. 14: end for

Learn to partition h( ) Expand the tree

Search Space

𝑈𝐶𝐵*=7 𝑈𝐶𝐵+=8

Only samples from 𝛀𝐄based on sample methods

(a) Split (b) Select (c) Sample

Select w.r.t ucb

High o(𝒙) Low o(𝒙)

Figure 2: (a) The leaf nodes D and E that correspond to the non-splittable space ΩD and ΩE. (b). The node selection procedure based on the UCB value. (c). The new samples generation from the selected space ΩE for bayesian optimization.

For each Dt,j, we then get good (small o(x)) and bad (large o(x)) samples by ranking them according to o(x). The smallest 50% are labeled to be positive while others are negative. Based on the labeled samples, a classiﬁer (e.g., Support Vector Machine (SVM)) is trained to learn a decision boundary as the latent action. We choose SVM since the classiﬁer needs to be decent in regions with few samples, and has the ﬂexibility of being parametric or non-parametric.

Exploration Using Upper Conﬁdence Bounds(UCB). As shown in Fig. 2, La MOO selects the ﬁnal leaf node by always choosing the child node with larger UCB value. The UCB value for a node j is deﬁned as UCBj := vj + 2Cp p

2 log nparent(j)/nj, where nj is the number of samples in node j, Cp is a tunable hyperparameter which controls the degree of exploration, and vj is the hypervolume of the samples in node j. The selected leaf corresponds the partitioned region Ωk as shown in Alg. 1.

Sampling in Search Region. We use existing algorithms as a sampling strategy in a leaf node, e.g., q EHVI (Daulton et al., 2020)) and CMA-ES (Igel et al., 2007a). Therefore, La MOO can be regarded as a meta-algorithm, applicable to any existing SOO/MOO solver to boost its performance.

La MOO with q EHVI. As a multi-objective solver, q EHVI ﬁnds data points to maximize a parallel version of Expected Hypervolume Improvement (EHVI) via Bayesian Optimization (BO). To incorporate q EHVI into La MOO s sampling step, we conﬁne q EHVI s search space using the tree-structured partition to better search MOO solutions.

La MOO with CMA-ES. CMA-ES is an evolutionary algorithm (EA) originally designed for singleobjective optimization. As a leaf sampler, CMA-ES is used to pick a sample that maximizes the dominance number o(x) within the leaf. Since o(x) changes over iterations, at iteration t, we ﬁrst update ot (x) of all previous samples at t < t to ot(x), then use CMA-ES. Similar to the q EHVI case, we constrain our search to be within the leaf region.

Published as a conference paper at ICLR 2022

Once a set of new samples Dnew is obtained (as well as its multiple function values f(Dnew)), we update all partitions along its path and the entire procedure is repeated.

5 EXPERIMENTS

We evaluate the performance of La MOO in a diverse set of scenarios. This includes synthetic functions, and several real-world MOO problems like neural architecture search, automobile safety design, and molecule discovery. In such real problems, often a bunch of criteria needs to be optimized at the same time. For example, for molecule (drug) discovery, one wants the designed drug to be effective towards the target disease, able to be easily synthesized, and be non-toxic to human body.

5.1 SMALL-SCALE PROBLEMS Synthetic Functions. Branin-Currin (Belakaria et al., 2019) is a function with 2-dimensional input and 2 objectives. DTLZ2 (Deb et al., 2002b) is a classical scalable multi-objective problem and is popularly used as a benchmark in the MOO community. We evaluate La MOO as well as baselines in DTLZ2 with 18 dimensions and 2 objectives, and 12 dimensions and 10 objectives, respectively.

Structural Optimization in Automobile Safety Design (vehicle safety) is a real-world problem with 5-dimensional input and 3 objectives, including (1) the mass of the vehicle, (2) the collision acceleration in a full-frontal crash, and (3) the toe-board intrusion (Liao et al., 2008).

Nasbench201 is a public benchmark to evaluate NAS algorithms (Dong & Yang, 2020). There are 15625 architectures in Nasbench201, with groundtruth #FLOPs and accuracy in CIFAR10 (Krizhevsky, 2009). Our goal is to minimize #FLOPs and maximize accuracy in this search space. We normalized #FLOPs to range [ 1, 0] and accuracy to [0, 1].

Figure 3: Left: Branin-Currin with 2 dimensions and 2 objectives. Middle: Vehicle Safety with 5 dimensions and 3 objectives. Right: Nasbench201 with 6 dimensions and 2 objectives. We ran each algorithm 7 times (shaded area is std of the mean). Top: Bayesian Optimization w/o La MOO. Bottom: evolutionary algorithms w/o La MOO. Note the two algorithm families show very different sample efﬁciency in MOO tasks.

We compare La MOO with 4 classical evolutionary algorithms (CMA-ES (Igel et al., 2007a), MOEA/D (Zhang & Li, 2007), NSGA-II (Deb et al., 2002a), and NSGA-III (Deb & Jain, 2014)) and 2 state-of-the-art BO methods (q EHVI (Daulton et al., 2020) and q Parego (Knowles, 2006)).

Evaluation Criterion. we ﬁrst obtain the maximal hypervolume (either by ground truth or from the estimation of massive sampling), then run each algorithm and compute the log hypervolume difference (Daulton et al., 2020): HVlog_diﬀ:= log(HVmax HVcur) (6) where HVcur is the hypervolume of current samples obtained by the algorithm with given budget.

Result. As shown in Fig. 3, La MOO with q EHVI outperforms all our BO baselines and La MOO with CMA-ES outperforms all our EA baselines, in terms of HVlog_diﬀ.

Published as a conference paper at ICLR 2022

50 100 150 200 Samples

Log Hypervolume Diff

q EHVI+La MOO q EHVI q PAREGO

50 100 150 200 Samples

Log Hypervolume Diff

CMAES+La MOO CMAES NSGA-III NSGA-II MOEAD

50 100 150 200 Samples

Log Hypervolume Diff

q EHVI+La MOO q EHVI q PAREGO

50 100 150 200 Samples

Log Hypervolume Diff

CMAES+La MOO CMAES NSGA-III NSGA-II MOEAD

Figure 4: DTLZ2 with many objectives, We ran each algorithm 7 times (shaded area is std of the mean). From left to right: BO with 2 objectives; EA with 2 objectives; BO with 10 objectives; EA with 10 objectives.

Evolutionary algorithms rely on mutation and crossover of previous samples to generate new ones, and may be trapped into local optima. Thanks to MCTS, La MOO also considers exploration and greatly improves upon vanilla CMA-ES over three different tasks with 1000/200 samples in smallscale /many objective problems. In addition, by plugging in BO, La MOO+q EHVI achieves 225% sample efﬁciency compared to other BO algorithms on Nasbench201. This result indicates that for high-dimensional problems (6 in Nasbench201), space partitioning leads to faster optimization. We further analyze very high-dimensional problems on Sec. 5.2 and visualize Pareto frontier in Fig. 12.

Optimization of Many Objectives. While NSGA-II and NSGA-III perform well in the two-objective problems, all evolutionary-based baselines get stuck in the ten-objective problems. In contrast, La MOO performs reasonably well. From Fig. 4, q EHVI+La MOO shows strong performance in ten objectives. When combined with a CMA-ES, La MOO helps it escape the initial region to focus on a smaller promising region by space partitioning.

5.2 MULTI-OBJECTIVE MOLECULE DISCOVERY

200 400 600 800 1000 Samples

Hypervolume

q EHVI+La MOO q EHVI CMAES+La MOO CMAES

q PAREGO NSGA-II NSGA-III MOEAD

200 400 600 800 1000 Samples

Hypervolume

q EHVI+La MOO q EHVI CMAES+La MOO CMAES

q PAREGO NSGA-II NSGA-III MOEAD

200 400 600 800 1000 Samples

Hypervolume

q EHVI+La MOO q EHVI CMAES+La MOO CMAES

q PAREGO NSGA-II NSGA-III MOEAD

Figure 5: Molecule Discovery: Left: Molecule discovery with two objectives (GSK3β+JNK3). Middle: Molecule discovery with three objectives (QED+SA+SARS). Right: Molecule Discovery with four objectives (GSK3β+JNK3+QED+SA). We ran each algorithm 15 times (shaded area is std of the mean).

Next, we tackle the practical problem of multi-objective molecular generation, which is a highdimensional problem (search space is 32-dimensional). Molecular generation models are a critical component of pharmaceutical drug discovery, wherein a cheap-to-run in silico model proposes promising molecules which can then be synthesized and tested in a lab (Vamathevan et al., 2019). However, one commonly requires the generated molecule to satisfy multiple constraints: for example, new drugs should generally be non-toxic and ideally easy-to-synthesize, in addition to their primary purpose. Therefore, in this work, we consider several multi-objective molecular generation setups from prior work on molecular generation (Yu et al., 2019; Jin et al., 2020b; Yang et al., 2021): (1) activity against biological targets GSK3β and JNK3, (2) the same targets together with QED (a standard measure of drug-likeness ) and SA (a standard measure of synthetic accessibility), and (3) activity against SARS together with QED and SA. In each task, we propose samples from a pretrained 32-dimensional latent space from (Jin et al., 2020a), which are then decoded into molecular strings and fed into the property evaluators from prior work.

Fig. 5 shows that La MOO+q EHVI outperforms all baselines by up to 10% on various combinations of objectives. While EA struggles to optimize these high-dimensional problems due to the limitations mentioned in Sec. 2, La MOO helps them (e.g., CMA-ES) to perform much better.

5.3 ABLATION STUDIES Visualization of La MOO. To understand how La MOO works, we visualize its optimization procedure for Branin-Currin. First, the Pareto optimal set ΩP is estimated from 106 random samples

Published as a conference paper at ICLR 2022

(a) The structure of Monte Carlo Tree at the final search iteration.

(b) The samples and SVM split region of leaf node of the tree in the search space.

(c) The selected region in selected path of the tree.

(d) The samples in the objective space at different search iterations.

Figure 6: Visualization of selected region at different search iterations and nodes. (a) The Monte-Carlo tree with colored leaves. Selected path is marked in red. (b) Visualization of the regions(ΩJ, ΩK, ΩI, ΩE, ΩF , ΩG) that are consistent with leaves in (a) in the search space. (c) Visualization of selected path at ﬁnal iteration. (d) Visualization of samples during search; bottom left is the Pareto frontier estimated from one million samples.

(marked as black stars), as shown in both search and objective space (Fig. 6(b) and bottom left of Fig. 6(c)). Over several iterations, La MOO progressively prunes away unpromising regions so that the remaining regions approach ΩP (Fig 6(c)). Fig 6(a) shows the ﬁnal tree structure. The color of each leaf node corresponds to a region in the search space (Fig 6(b)). The selected region is recursively bounded by SVM classiﬁers corresponding to nodes on the selected path (red arrow in Fig 6(a)). The new samples are only generated from the most promising region ΩJ, improving sample efﬁciency.

20 40 60 80 100 Samples

Log Hypervolume Diff

sample = bayesian sample = cma-es sample = random

100 200 300 Samples

Log Hypervolume Diff

cp = 0 cp = 10% max_hv cp = 50% max_hv cp = 100% max_hv

20 40 60 80 100 Samples

Log Hypervolume Diff

kernel = poly kernel = linear kernel = rbf

Figure 7: Ablation studies on hyperparameters and sampling methods in La MOO. Left: Sampling without Bayesian/CMA-ES. Middle: Sampling with different Cp. Right: Partitioning with different svm kernels

Ablation of Design Choices. We show how different hyperparameters and sampling methods play a role in the performance. We perform the study in Vehicle Safety below.

Sampling methods. La MOO can be integrated with different sample methods, including Bayesian Optimization (e.g., q EHVI) and evolutionary algorithms (e.g., CMA-ES). Fig. 7(left) shows that compared to random sampling, q EHVI improves a lot while CMA-ES only improves slightly. This is consistent with our previous ﬁnding that for MOO, BO is much more efﬁcient than EA.

The exploration factor Cp controls the balance of exploration and exploitation. A larger Cp guides La MOO to visit the sub-optimal regions more often. Based on the results in Fig. 7(middle), greedy search (Cp = 0) leads to worse performance compared to a proper Cp value (i.e. 10% of maximum hypervolume), which justiﬁes our usage of MCTS. On the other hand, over-exploration can also yield even worse results than greedy search. Therefore, a "rule of thumb" is to set the Cp to be roughly 10% of the maximum hypervolume HVmax. When HVmax is unknown, Cp can be set empirically.

SVM kernels. As shown in Fig. 7(right), we ﬁnd that the RBF kernel performs the best, in agreement with (Wang et al., 2020). Thanks to the non-linearity of the polynomial and RBF kernels, their region partitions perform better compared to a linear one.

6 CONCLUSION

We propose a search space partition optimizer called La MOO as a meta-algorithm that extends prior single-objective works (Wang et al., 2020; Yang et al., 2021) to multi-objective optimization. We demonstrated both theoretically and via experiments on multiple MOO tasks that La MOO signiﬁcantly improves the search performance compared to strong baselines like q EHVI and CMA-ES.

Published as a conference paper at ICLR 2022

7 ACKNOWLEDGEMENT

This work was supported in part by NSF Grants #1815619 and #2105564, a VMWare grant, and computational resources supported by the Academic & Research Computing group at Worcester Polytechnic Institute.

Sanghamitra Bandyopadhyay, Sankar K Pal, and B Aruna. Multiobjective gas, quantitative indices, and pattern classiﬁcation. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 34(5):2088 2099, 2004.

Syrine Belakaria, Aryan Deshwal, and Janardhan Rao Doppa. Max-value entropy search for multiobjective bayesian optimization. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/ file/82edc5c9e21035674d481640448049f3-Paper.pdf.

Nicola Beume and Günter Rudolph. Faster s-metric calculation by considering dominated hypervolume as klee s measure problem, 2006.

Nicola Beume, Boris Naujoks, and Michael Emmerich. Sms-emoa: Multiobjective selection based on dominated hypervolume. European Journal of Operational Research, 181(3):1653 1669, 2007. ISSN 0377-2217. doi: https://doi.org/10.1016/j.ejor.2006.08.008. URL https://www. sciencedirect.com/science/article/pii/S0377221706005443.

Nicola Beume, Carlos M. Fonseca, Manuel Lopez-Ibanez, LuÍs Paquete, and Jan Vahrenhold. On the complexity of computing the hypervolume indicator. IEEE Transactions on Evolutionary Computation, 13(5):1075 1082, 2009. doi: 10.1109/TEVC.2009.2015575.

Peter AN Bosman and Dirk Thierens. The balance between proximity and diversity in multiobjective evolutionary algorithms. IEEE transactions on evolutionary computation, 7(2):174 188, 2003.

Han Cai, Ligeng Zhu, and Song Han. Proxyless NAS: Direct neural architecture search on target task and hardware. In International Conference on Learning Representations, 2019. URL https://arxiv.org/pdf/1812.00332.pdf.

Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once for all: Train one network and specialize it for efﬁcient deployment. In International Conference on Learning Representations, 2020. URL https://arxiv.org/pdf/1908.09791.pdf.

Kuang-Hua Chang. Chapter 5 - multiobjective optimization and advanced topics. In Kuang-Hua Chang (ed.), Design Theory and Methods Using CAD/CAE, pp. 325 406. Academic Press, Boston, 2015. ISBN 978-0-12-398512-5. doi: https://doi.org/10.1016/ B978-0-12-398512-5.00005-0. URL https://www.sciencedirect.com/science/ article/pii/B9780123985125000050.

Ivo Couckuyt, Dirk Deschrijver, and Tom Dhaene. Fast calculation of multiobjective probability of improvement and expected improvement criteria for pareto optimization. Journal of Global Optimization, 60(3):575 594, 2014.

Samuel Daulton, Maximilian Balandat, and Eytan Bakshy. Differentiable expected hypervolume improvement for parallel multi-objective bayesian optimization. ar Xiv preprint ar Xiv:2006.05078, 2020.

J. L. P erez De la Cruz, J. L. P erez De la Cruz, L. Mandow, and L. Mandow. A new approach to multiobjective a* search. In Proceedings of the 19th International Joint Conference on Artiﬁcial Intelligence, IJCAI 05, pp. 218 223, 2005.

K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE Transactions on Evolutionary Computation, 6(2):182 197, 2002a. doi: 10.1109/ 4235.996017.

Published as a conference paper at ICLR 2022

K. Deb, L. Thiele, M. Laumanns, and E. Zitzler. Scalable multi-objective optimization test problems. In Proceedings of the 2002 Congress on Evolutionary Computation. CEC 02 (Cat. No.02TH8600), volume 1, pp. 825 830 vol.1, 2002b. doi: 10.1109/CEC.2002.1007032.

Kalyanmoy Deb and Himanshu Jain. An evolutionary many-objective optimization algorithm using reference-point-based nondominated sorting approach, part i: Solving problems with box constraints. IEEE Transactions on Evolutionary Computation, 18(4):577 601, 2014. doi: 10.1109/ TEVC.2013.2281535.

Guoqiang Deng, Zhangcan Huang, and Min Tang. Research in the performance assessment of multi-objective optimization evolutionary algorithms. In 2007 International Conference on Communications, Circuits and Systems, pp. 915 918. IEEE, 2007.

Xuanyi Dong and Yi Yang. Nas-bench-201: Extending the scope of reproducible neural architecture search. In International Conference on Learning Representations (ICLR), 2020. URL https: //openreview.net/forum?id=HJxy Zk BKDr.

Xi Fang, Wenwen Wang, Lang He, Zhangcan Huang, Yang Liu, and Liang Zhang. Research on improved nsga-ii algorithm and its application in emergency management. Research on Improved NSGA-II Algorithm and Its Application in Emergency Management, 2018. doi: 10.1155/2018/ 1306341.

M. Fleischer. The measure of pareto optima applications to multi-objective metaheuristics. In Carlos M. Fonseca, Peter J. Fleming, Eckart Zitzler, Lothar Thiele, and Kalyanmoy Deb (eds.), Evolutionary Multi-Criterion Optimization, pp. 519 533, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg.

Nyoman Gunantara. A review of multi-objective optimization: Methods and its applications. Cogent Engineering, 5(1):1502242, 2018. doi: 10.1080/23311916.2018.1502242.

Tatsunori Hashimoto, Steve Yadlowsky, and John Duchi. Derivative free optimization via repeated classiﬁcation. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 2027 2036. PMLR, 2018a.

Tatsunori Hashimoto, Steve Yadlowsky, and John Duchi. Derivative free optimization via repeated classiﬁcation. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 2027 2036. PMLR, 2018b.

Oliver Hinder. Cutting plane methods can be extended into nonconvex optimization. In Conference On Learning Theory, pp. 1451 1454. PMLR, 2018.

Christian Igel, Nikolaus Hansen, and Stefan Roth. Covariance matrix adaptation for multi-objective optimization. Evolutionary Computation, 15(1):1 28, 2007. doi: 10.1162/evco.2007.15.1.1.

Christian Igel, Thorsten Suttorp, and Nikolaus Hansen. Steady-state selection and efﬁcient covariance matrix update in the multi-objective cma-es. In Shigeru Obayashi, Kalyanmoy Deb, Carlo Poloni, Tomoyuki Hiroyasu, and Tadahiko Murata (eds.), Evolutionary Multi-Criterion Optimization, pp. 171 185, Berlin, Heidelberg, 2007a. Springer Berlin Heidelberg.

Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Hierarchical generation of molecular graphs using structural motifs. In International Conference on Machine Learning, pp. 4839 4848. PMLR, 2020a.

Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Multi-objective molecule generation using interpretable substructures. In International Conference on Machine Learning, pp. 4849 4859. PMLR, 2020b.

Donald R Jones, Matthias Schonlau, and William J Welch. Efﬁcient global optimization of expensive black-box functions. Journal of Global optimization, 13(4):455 492, 1998.

Kenji Kawaguchi, Leslie Pack Kaelbling, and Tomás Lozano-Pérez. Bayesian optimization with exponential convergence. 2015.

Published as a conference paper at ICLR 2022

Beomjoon Kim, Kyungjae Lee, Sungbin Lim, Leslie Kaelbling, and Tomás Lozano-Pérez. Monte carlo tree search in continuous spaces using voronoi optimistic optimization with regret bounds. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34, pp. 9916 9924, 2020.

Taehyeon Kim, Jaeyeon Ahn, Nakyil Kim, and Seyoung Yun. Adaptive local bayesian optimization over multiple discrete variables. URL https://valohaichirpprod.blob.core. windows.net/papers/kaist_osi.pdf.

J. Knowles. Parego: a hybrid algorithm with on-line landscape approximation for expensive multiobjective optimization problems. IEEE Transactions on Evolutionary Computation, 10(1):50 66, 2006. doi: 10.1109/TEVC.2005.851274.

Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In Johannes Fürnkranz, Tobias Scheffer, and Myra Spiliopoulou (eds.), Machine Learning: ECML 2006, pp. 282 293, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg. ISBN 978-3-540-46056-5.

Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.

Manoj Kumar, George E. Dahl, Vijay Vasudevan, and Mohammad Norouzi. Parallel architecture and hyperparameter search via successive halving and classiﬁcation. Co RR, abs/1805.10255, 2018. URL http://arxiv.org/abs/1805.10255.

H. T. Kung, F. Luccio, and F. P. Preparata. On ﬁnding the maxima of a set of vectors. J. ACM, 22(4):469 476, October 1975. ISSN 0004-5411. doi: 10.1145/321906.321910. URL https: //doi.org/10.1145/321906.321910.

Miqing Li, Shengxiang Yang, and Xiaohui Liu. Diversity comparison of pareto front approximations in many-objective optimization. IEEE Transactions on Cybernetics, 44(12):2568 2584, 2014.

Xingtao Liao, Qing Li, Xujing Yang, Weigang Zhang, and Wei Li. Multiobjective optimization for crash safety design of vehicles using stepwise regression model. Structural and multidisciplinary optimization, 35(6):561 569, 2008.

GV Loganathan and Hanif D Sherali. A convergent interactive cutting-plane algorithm for multiobjective optimization. Operations Research, 35(3):365 377, 1987.

Ilya Loshchilov, Marc Schoenauer, and Michèle Sebag. A mono surrogate for multiobjective optimization. In Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation, GECCO 10, pp. 471 478, New York, NY, USA, 2010. Association for Computing Machinery. ISBN 9781450300728. doi: 10.1145/1830483.1830571. URL https: //doi.org/10.1145/1830483.1830571.

Rémi Munos. Optimistic optimization of a deterministic function without the knowledge of its smoothness. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011a. URL https://proceedings.neurips.cc/paper/2011/file/ 7e889fb76e0e07c11733550f2a6c7a5a-Paper.pdf.

Rémi Munos. Optimistic optimization of a deterministic function without the knowledge of its smoothness. Advances in neural information processing systems, 24:783 791, 2011b.

Linqiang Pan, Cheng He, Ye Tian, Handing Wang, Xingyi Zhang, and Yaochu Jin. A classiﬁcationbased surrogate-assisted evolutionary algorithm for expensive many-objective optimization. IEEE Transactions on Evolutionary Computation, 23(1):74 88, 2019. doi: 10.1109/TEVC.2018. 2802784.

Lie Meng Pang, Hisao Ishibuchi, and Ke Shang. Nsga-ii with simple modiﬁcation works well on a wide variety of many-objective problems. IEEE Access, 8:190240 190250, 2020. doi: 10.1109/ACCESS.2020.3032240.

Biswajit Paria, Kirthevasan Kandasamy, and Barnabás Póczos. A ﬂexible multi-objective bayesian optimization approach using random scalarizations. Co RR, abs/1805.12168, 2018. URL http: //arxiv.org/abs/1805.12168.

Published as a conference paper at ICLR 2022

Wolfgang Ponweiser, Tobias Wagner, Dirk Biermann, and Markus Vincze. Multiobjective optimization on a limited budget of evaluations using model-assisted-metric selection. In Günter Rudolph, Thomas Jansen, Nicola Beume, Simon Lucas, and Carlo Poloni (eds.), Parallel Problem Solving from Nature PPSN X, pp. 784 794, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg.

Mikita Sazanovich, Anastasiya Nikolskaya, Yury Belousov, and Aleksei Shpilman. Solving black-box optimization challenge via learning search space partition for local bayesian optimization. URL https://valohaichirpprod.blob.core.windows.net/papers/ jetbrains.pdf.

Chun-Wei Seah, Yew-Soon Ong, Ivor W. Tsang, and Siwei Jiang. Pareto rank learning in multiobjective evolutionary algorithms. In 2012 IEEE Congress on Evolutionary Computation, pp. 1 8, 2012. doi: 10.1109/CEC.2012.6252865.

Nidamarthi Srinivas and Kalyanmoy Deb. Muiltiobjective optimization using nondominated sorting in genetic algorithms. Evolutionary computation, 2(3):221 248, 1994.

Bradley S. Stewart and Chelsea C. White. Multiobjective a*. J. ACM, 38(4):775 814, oct 1991. ISSN 0004-5411. doi: 10.1145/115234.115368. URL https://doi.org/10.1145/115234. 115368.

William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933.

Chi Tung Tung and Kim Lin Chew. A multicriteria pareto-optimal path algorithm. European Journal of Operational Research, 62(2):203 209, 1992. ISSN 0377-2217. doi: https://doi.org/ 10.1016/0377-2217(92)90248-8. URL https://www.sciencedirect.com/science/ article/pii/0377221792902488.

Jessica Vamathevan, Dominic Clark, Paul Czodrowski, Ian Dunham, Edgardo Ferran, George Lee, Bin Li, Anant Madabhushi, Parantu Shah, Michaela Spitzer, et al. Applications of machine learning in drug discovery and development. Nature Reviews Drug Discovery, 18(6):463 477, 2019.

David A Van Veldhuizen and Gary B Lamont. Evolutionary computation and convergence to a pareto front. In Late breaking papers at the genetic programming 1998 conference, pp. 221 228. Citeseer, 1998.

Douglas AG Vieira and Adriano Chaves Lisboa. A cutting-plane method to nonsmooth multiobjective optimization problems. European Journal of Operational Research, 275(3):822 829, 2019.

Linnan Wang, Saining Xie, Teng Li, Rodrigo Fonseca, and Yuandong Tian. Sample-efﬁcient neural architecture search by learning action space. Co RR, abs/1906.06832, 2019. URL http: //arxiv.org/abs/1906.06832.

Linnan Wang, Rodrigo Fonseca, and Yuandong Tian. Learning search space partition for black-box optimization using monte carlo tree search. ar Xiv preprint ar Xiv:2007.00708, 2020.

Ziyu Wang, Babak Shakibi, Lin Jin, and Nando Freitas. Bayesian multi-scale optimistic optimization. In Artiﬁcial Intelligence and Statistics, pp. 1005 1014. PMLR, 2014.

Kaifeng Yang, Michael Emmerich, André Deutz, and Thomas Bäck. Multi-objective bayesian global optimization using expected hypervolume improvement gradient. Swarm and Evolutionary Computation, 44:945 956, 2019. ISSN 2210-6502. doi: https://doi.org/10.1016/j. swevo.2018.10.007. URL https://www.sciencedirect.com/science/article/ pii/S2210650217307861.

Kevin Yang, Tianjun Zhang, Chris Cummins, Brandon Cui, Benoit Steiner, Linnan Wang, Joseph E. Gonzalez, Dan Klein, and Yuandong Tian. Learning space partitions for path planning. Co RR, abs/2106.10544, 2021. URL https://arxiv.org/abs/2106.10544.

Kaicheng Yu, Christian Sciuto, Martin Jaggi, Claudiu Musat, and Mathieu Salzmann. Evaluating the search phase of neural architecture search. ar Xiv preprint ar Xiv:1902.08142, 2019.

Published as a conference paper at ICLR 2022

Qingfu Zhang and Hui Li. Moea/d: A multiobjective evolutionary algorithm based on decomposition. IEEE Transactions on Evolutionary Computation, 11(6):712 731, 2007. doi: 10.1109/TEVC.2007. 892759.

E. Zitzler, L. Thiele, M. Laumanns, C.M. Fonseca, and V.G. da Fonseca. Performance assessment of multiobjective optimizers: an analysis and review. IEEE Transactions on Evolutionary Computation, 7(2):117 132, 2003. doi: 10.1109/TEVC.2003.810758.

Eckart Zitzler, Kalyanmoy Deb, and Lothar Thiele. Comparison of multiobjective evolutionary algorithms: Empirical results. Evolutionary computation, 8(2):173 195, 2000.

Published as a conference paper at ICLR 2022

Lemma 1. The algorithm to uniformly draw k samples in S, pick the best and return is a (1, 1)-oracle.

Proof. Consider the following simple (1, 1)-oracle for single-objective optimization: after sampling k samples, we rank them according to their function values, and split them into two k/2 smaller subsets Sgood and Sbad. Other points are randomly assigned to either of the two subsets. Then if x happens to be among the k collected samples (which happens with probability k/|S|), deﬁnitely we have x Sgood. Therefore, we have:

P (x Sgood|x S) k

|S| 1 exp k

which is an oracle with α = η = 1. The last inequality is due to ex 1 + x (and thus e x 1 x).

Lemma 2. Deﬁne g(λ) : R+ 7 R+ as:

t=1 wt log 1 + 1 λwt

The following maximization problem

t=1 log 1 e zt s.t.

t=1 wtzt = K (9)

has optimal solutions

z t = log 1 + 1 λwt

, 1 t T (10)

where λ is determined by g(λ) = K. With optimal {z t }, the objective reaches P

t log(1 + λwt).

Proof. Its Lagrange is:

t=1 log 1 e zt λ

Taking derivative w.r.t. zt and we have:

J zt = e zt

1 e zt λwt = 0.

1 1 e zt 1 λwt = 0

1 1 e zt = 1 + λwt

1 e zt = 1 1 + λwt

e zt = 1 1 1 + λwt = λwt 1 + λwt

zt = log λwt 1 + λwt = log 1 + λwt

λwt = log 1 + 1 λwt

Lemma 3. Both g(λ) and g 1(y) are monotonously decreasing. Furthermore, let w :=

T PT t=1 w 1 t 1 be the Harmonic mean of {wt} and wmax := max T t=1 wt, we have:

exp( w 1y/T) 1 g 1(y) w 1 max exp(w 1 maxy/T) 1 T

Published as a conference paper at ICLR 2022

0 10 20 30 40 50 y

Bounds of log g 1(y) (large wi span)

upper bound lower bound

0 10 20 30 40 50 y

Bounds of log g 1(y) (small wi span)

upper bound lower bound

Figure 8: Upper and lower bounds of g 1(y) with different {wi}. Left: wi = 2linspace( 0.1,10). Right: wi = 2linspace(2,5). Small {wi} span leads to better bounds.

Proof. It is easy to see when λ increases, each term in g(λ) decreases and thus g(λ) is a decreasing function of λ. Therefore, its inverse mapping g 1(y) is also decreasing.

Let µ(y) := 1/g 1(y) > 0. Then we have:

t=1 wt log 1 + µ(y)

It is clear that when y = 0, µ(y) = 0. Taking derivative with respect to y in both side, we have:

where µ (y) = dµ(y)

dy is the derivative of µ(y). Using the property of Harmonic mean, we have:

PT t=1 1 + µ(y)

This gives: µ (y) 1 + µ(y)/ w 1

Integrate on both side starting from y = 0, we have:

w log(1 + µ(y)/ w)

Using µ(0) = 0 we thus have: w log(1 + µ(y)/ w) y

This leads to µ(y) w exp(y w 1/T) 1 . With g 1(y) = 1/µ(y), we arrive the ﬁnal lower bound for g 1(y).

For an alternative upper bound of g 1(y), we just notice that (here wmax := maxt wt):

Using the same technique as above, we have µ(y) wmax exp(yw 1 max/T) 1 and the upper bound of g 1(y) follows.

Finally, note that ex 1 + x, we have

w 1 max exp(w 1 maxy/T) 1 w 1 max w 1 maxy/T = T

Published as a conference paper at ICLR 2022

Theorem 1. Following optimal sequence, the algorithm yields a reward r lower bounded by the following:

r rb exp log 2 ηN αφ(α, T)

where rb := N 1 and φ(α, T) := (1 2 αT )/(1 2 α).

Proof. First note that |ST | |S0|/2T and thus 1 |ST | 2T /N. So we just need to bound P(x ST ), which can be written as:

t=1 P(x St|x St 1)

1 exp kt η|St 1|α

Therefore we have

log P(x ST )

t=1 log 1 exp kt η|St 1|α

We want to ﬁnd the action sequence {kt} so that log P(x ST ) is maximized. Let wt := η|St 1|α and zt := kt/wt, applying Lemma 2, and we know that

max {kt} log P(x ST )

t=1 log(1 + λwt) (25)

where the Lagrangian multiplier λ satisﬁes the equation g(λ) = K.

Now we have:

t=1 log(1 + λwt) 1

t=1 log 1 + T

t=1 log 1 + T

K η(N/2t 1)α

1 2α(t 1) (28)

= φ(α, T)ηTN α

Here 1 is due to Lemma 3 which tells that λ = g 1(K) T/K, 2 is due to wt := η|St 1|α and |St 1| N/2t 1, and 3 due to log(1 + x) x.

Putting all of them together, we know that

r max {kt} 1 |ST |P(x ST ) 2T

N exp φ(α, T)ηTN α

Optimal action sequence {k t }. From the proof, we could also write down the optimal action sequence that achieves the best reward: k t = wt log 1 + 1 λwt

, where wt := η|St 1|α. Using

Lemma 3, we could compute the upper and lower bound estimation of λ = g 1(K). Here w :=

T PT t=1 w 1 t 1 be the Harmonic mean of {wt} and wmax := max T t=1 wt:

exp( w 1K/T) 1 λ w 1 max exp(w 1 max K/T) 1 (31)

With λ, we could compute approximate {k t }. Here we make a rough estimation of {k t } if we terminate the algorithm when |ST | is still fairly large. This case corresponds to the setting T =

Published as a conference paper at ICLR 2022

β log2 N where β < 1 and all wt N α. With K = Θ(N α) as in semi-parametric case, w 1K = Θ(1), exp( w 1K/T) 1 w 1K/T and λwt log2 N 1. Since log(1 + x) x for small x, we have k t wt 1 λwt = 1/λ, which is independent of t. Therefore, a constant amount of sampling at each stage is approximately optimal.

Observation 1. If all fj are isotropic, fj(x) = x cj 2 2, then ΩP = Convex Hull(c1, . . . , c M).

Proof. Consider J(x; µ) := PM j=1 µjfj(x) where the weights µj 0 satisﬁes P

j µj = 1. For brevity, we write the constraint as := {µ : µj 0, P

Now consider the Pareto Set ΩP := {x : µ : x J(x; µ) = 0}. We have the following:

x J(x; µ) = 0 (32)

j µj xfj(x) = 0 (33)

j µj(x cj) = 0 (34)

j µjcj (35)

The last step is due to the fact that P

j µj = 1. Therefore, for any x ΩP , x is a convex combination of {c1, . . . , c M} and thus x Convex Hull(c1, . . . , c M). Conversely, for any x Convex Hull(c1, . . . , c M), we know x J(x; µ) = 0 and thus x ΩP .

Observation 2. If M = 2 and fj(x) = (x cj) Hj(x cj) where Hj are positive deﬁnite symmetric matrices, then there exists w1 := H2(c2 c1) and w2 := H1(c1 c2), so that for any x ΩP , w 1 (x c1) 0 and w 2 (x c2) 0.

Proof. Following Observation 1, similarly we have for all x ΩP , P

j µj Hj(x cj) = 0, which gives:

j µj Hjcj (36)

Note that this is an expression of the Pareto Set ΩP .

Let Aj := (P

j µj Hj) 1µj Hj. Then P

j Aj = I. Note that while P

j µj Hj and (P

j µj Hj) 1 are positive deﬁnite matrix. Aj may not be.

i µi Hi. Since µ , M is a PD matrix. Note that we have X

j µj Hjcj = X

j µj Hjcj X

j µj Hjck + X

j µj Hjck (37)

j =k µj Hj(cj ck) + Mck (38)

Using Eqn. 36, we know that x = M 1 P

j µj Hjcj = ck + M 1 P

j =k µj Hj(cj ck).

For M = 2, we have x = c2 + M 1µ1H1(c1 c2). So we have

(c1 c2) H1x = (c1 c2) H1c2 + (c1 c2) H1M 1H1(c1 c2) (39)

(c1 c2) H1c2 (40)

This is because (c1 c2)H1M 1H1(c1 c2) 0 since H1M 1H1 is a PSD matrix. Therefore, let w2 := H1(c1 c2) and we have w 2 (x c2) 0, which is independent of µ . This means it holds for any x ΩP .

Let w1 = H2(c2 c1), then similarly we have w 1 (x c1) 0 for all x ΩP .

Published as a conference paper at ICLR 2022

B QUALITY INDICATORS COMPARISON

Table 1: The review of different scalarizing methods.

Quality Indicator Convergence Uniformity Spread No reference set required Hyper Volume

Generational Distance(GD) (Van Veldhuizen & Lamont, 1998) measures the distance the pareto frontier of approximation samples and true pareto frontier, which requires prior knowledge of true pareto frontier and only convergence is considered. IGD (Bosman & Thierens, 2003) is improved version of GD. IGD calculates the distance the points on true pareto frontier to the closest point on pareto frontier of current samples. Inverted Generational Distance(IGD) satisﬁes all three evaluation metrics of QI but requires true pareto frontier which is hardly to get in real-world problem. Maximum Spread(MS) (Zitzler et al., 2000) computes the distance between the farthest two points of samples to evaluate the spread. Spacing(S) (Bandyopadhyay et al., 2004) measures how close the distribution of pareto frontier of samples is to uniform distribution. Overall Non-dominated Vector Generation and Ratio(ONVGR) is the ratio of number of samples in true pareto-frontier. The table 1 demonstrates the good characteristics of each quality indicators.

C END-TO-END LAMOO PSEUDOCODE

Below we list the pseudocode for the end-to-end workﬂow of La MOO in Algorithm 2. Specﬁcally, it includes search space partition in Function Split. Node(promising region) selection in Function Select, and new samples generation in Function Sample.

Algorithm 2 La MOO Pseudocode.

1: Inputs: Initial D0 from uniform sampling, sample budget T. 2: for t = 0, . . . , T do 3: Set L {Ωroot} (collections of regions to be split). 4: V, v, n Split(L, Dt) 5: k Select(Cp, Dt) 6: Dt+1 Sample(k) 7: end for 8: 9: Function Split(L, Dt) 10: while L = do 11: Ωj pop_ﬁrst_element(L), Dt,j Dt Ωj, nt,j |Dt,j|. 12: Compute dominance number ot,j of Dt,j using Eqn. 5 and train SVM model h( ). 13: If (Dt,j, ot,j) is splittable by SVM, then L L Partition(Ωj, h( )). 14: end while 15: 16: Function Select(Cp, Dt) 17: for k = root, k is not leaf node do 18: Dt,k Dt Ωk, vt,k Hyper Volume(Dt,k), nt,k |Dt,k|.

19: k arg max c children(k) UCBt,c, where UCBt,c := vt,c + 2Cp q

nt,c 20: end for 21: return k 22: 23: Function Sample(k) 24: Dt+1 Dt Dnew, where Dnew is drawn from Ωk based on q EHVI or CMA-ES. 25: return Dt Dnew

Published as a conference paper at ICLR 2022

D EXPLORATION FACTOR(Cp) SETUP WITH UNKNOWN MAXIMUM HYPERVOLUME

20 40 60 80 100 Samples

Log Hypervolume Diff

cp=10% max_hv cp=10% current_hv

(a) Branin Currin

20 40 60 80 100 Samples

Log Hypervolume Diff

cp=10% max_hv cp=10% current_hv

(b) Vehicle Safety

20 40 60 80 100 Samples

Log Hypervolume Diff

cp=10% max_hv cp=10% current_hv

(c) Nasbench201

Figure 9: Sampling with static Cp(10% of HVmax) and dynamic Cp((10% of HVcurrent))

As we mentioned in the paper, a "rule of thumb" is to set the Cp to be roughly 10% of the maximum hypervolume HVmax. If HVmax is unknown, Cp can be dynamically set to 10% of the hypervolume of current samples in each search iteration. The ﬁgures below demonstrate the difference between 10% HVmax and 10% current hypervolume in three problems(Branin-Currin, Vehicle Safety, and Nasbench201 from left to right). The ﬁnal performances by using 10% HVmax and 10% current hypervolume are similar.

E WALL CLOCK TIME IN DIFFERENT PROBLEMS

0 100 200 Average Cumulative Wall Clock Time(s)

Log Hypervolume Diff

q EHVI+La MOO q EHVI QPAREGO CMAES+La MOO

CMAES NSGA-II NSGA-III NMOEAD

(a) Branin Currin

0 500 1000 Average Cumulative Wall Clock Time(s)

Log Hypervolume Diff

q EHVI+La MOO q EHVI QPAREGO CMAES+La MOO

CMAES NSGA-II NSGA-III NMOEAD

(b) Vehicle Safety

0 200 400 600 Average Cumulative Wall Clock Time(s)

Log Hypervolume Diff

q EHVI+La MOO q EHVI QPAREGO CMAES+La MOO

CMAES NSGA-II NSGA-III NMOEAD

(c) Nasbench201

Figure 10: Wall clock time in different problems

Fig. 10 shows the wall clock time of different search algorithms in Branin Currin(Belakaria et al., 2019), Vehicle Safefy (Liao et al., 2008) and Nasbench201 (Dong & Yang, 2020).

F DETAILS OF BENCHMARK PROBLEMS

F.1 PROBLEM DESCRIPTION

Branin Currin (Belakaria et al., 2019):

f (1)(x1, x2) = (15x2 5.1 (15x1 5)2

4π2 + 75x1 25

π 5)2 + (10 10

8π ) cos(15x1 5)

f (2)(x1, x2) = 1 exp 1 (2x2)

2300x3 1 + 1900x2 1 + 2092x1 + 60 100x3 1 + 500x2 1 + 4x1 + 20

Published as a conference paper at ICLR 2022

where x1, x2 [0, 1].

Vehicle Safefy (Liao et al., 2008):

f1(x) = 1640.2823 + 2.3573285x1 + 2.3220035x2 + 4.5688768x3 + 7.7213633x4 + 4.4559504x5 f2(x) = 6.5856 + 1.15x1 1.0427x2 + 0.9738x3 + 0.8364x4 0.3695x1x4 + 0.0861x1x5 + 0.3628x2x4 + 0.1106x2 1 0.3437x2 3 + 0.1764x2 4 f3(x) = 0.0551 + 0.0181x1 + 0.1024x2 + 0.0421x3 0.0073x1x2 + 0.024x2x3 0.0118x2x4 0.0204x3x4 0.008x3x5 0.0241x2 2 + 0.0109x2 4

where x [1, 3]5.

Nasbench201 (Dong & Yang, 2020):

skip-connect

1x1 convolution

3x3 convolution

3x3 average pool

Figure 11: A general architecture of Nasbench201

In Nasbench201, the architectures are made by stacking the cells together. The difference among architectures in Nasbench201 is the design of the cells, see ﬁg 11. Speciﬁcally, each cell contains 4 nodes, and there is a particular operation connecting to two nodes including zeroize, skip-connect, 1x1 convolution, 3x3 convolution, and 3x3 average pooling. Therefore, there are C2 4 = 6 edges in a cell and 56 =15625 unique architectures in Nasbench201. According to this background, Each architecture can be encoded into a 6-dimensional vector with 5 discrete numbers (i.e., 0, 1, 2, 3, 4 that corresponds to zeroize, skip-connect, 1x1 convolution, 3x3 convolution, and 3x3 average pooling).

f1(x) = Accuracy(x) f2(x) = #FLOPs(x)

where x {0, 1, 2, 3, 4}6.

DTLZ2 (Deb et al., 2002b):

f1(x) = (1 + g(x M)) cos π

2 x M 2 cos π

f2(x) = (1 + g(x M)) cos π

2 x M 2 sin π

f3(x) = (1 + g(x M)) cos π

f M(x) = (1 + g(x M)) sin π

where g(x) = P

xi x M (xi 0.5)2, x [0, 1]d, and x M represents the last d M + 1 elements of x.

Published as a conference paper at ICLR 2022

F.2 VISUALIZATION OF PARETO-FRONTIER FOR BENCHMARK PROBLEMS

175 150 125 100 75 50 25 0 F1(x)

samples pareto-frontier

(a) Branin Currin

1695169016851680167516701665

samples pareto-frontier

(b) Vehicle Safety

8 7 6 5 4 3 2 F1(x)

samples pareto-frontier

(c) Nasbench201

Figure 12: Visualization of Pareto-frontier in Branin Currin, Vehicle Safety as well as Nasbench201.

F.3 REFERENCE POINTS

: M represents the number of objectives. Problem Reference Point Branin Currin (18.0, 6.0) Vehicle Safety (1864.72022, 11.81993945, 0.2903999384) Nasbench201 (-3.0, -6.0) DTLZ2 (1.1, .., 1.1) RM

Molucule Discovery (0.0, ..., 0.0) RM

Table 2: The reference points for all problems in this work.

The reference point R RM is deﬁned to measure the hypervolume of a problem. Different reference point would result in a different hypervolume. The details can be found at sec 1. Table 2 elaborates the reference points in the problems throughout the paper.

F.4 MAXIMUM HYPERVOLUME OF EACH PROBLEM

Problem Maximum Hypervolume Branin Currin 59.36011874867746 Vehicle Safety 246.81607081187002 Nasbench201 8.06987476348877 DTLZ2(2 objectives) 1.4460165933151778 DTLZ2(10 objectives) 2.5912520655298095 Molucule Discovery N/A

Table 3: The maximum hypervolume for all problems in this work.

Table 3 elaborates the observed maximal hypervolume in the problems throughout the paper. We used these value to calculate the log hypervolume difference in ﬁg 3 and ﬁg 4.

G EXPERIMENTS SETUP

Experiment details: For small-scale problems(i.e. Branin-Currin, Vehicle Safety, and Nasbench201) and DTLZ2 with 2 and 10 objectives. We randomly generate 10 samples as the initialization. For

Published as a conference paper at ICLR 2022

multi-objective molecule discovery, the number of initial samples is 150. In each iteration, we update 5 batched samples(q value) for all search algorithms.

Hyperparameters of LAMOO: For all problems, we leverage polynomials as the kernel type of SVM and the degree of the polynomial kernel function is set to 4. The minimum samples in the leaf of MCTS is 10. The cp is roughly set to 10% of maximum of hypervolume(i.e. Branin-Currin -> 5, Vehicle Safety -> 20, Nasbench201 -> 6, DTLZ2(2 objectives) -> 0.1, DTLZ2(10 objectives) -> 0.25, molecule discovery(2 objectives) -> 0.03, molecule discovery(3 objectives) -> 0.2, molecule discovery(4 objectives) -> 0.06).

Hyperparameters of q EHVI and q Par EGO: The number of q is set to 5. The acquisition function is optimized with L-BFGS-B (with a maximum of 200 iterations). In each iteration, 256 raw samples used for the initialization heuristic are generated to be selected by the acquisition function. In original work Daulton et al. (2020), they used 1024 raw samples but we decrease this number to 256 to sample budget of all methods for comparison, which speeds up the search but may lead to lower performance such as vehiclesafty problem in ﬁg. 3. As the same claim in Daulton et al. (2020), each generated sample is modeled with an independent Gaussian process with a Matern 5/2 ARD kernel.

H VERIFICATION OF LAMOO ON MANY-OBJECTIVE PROBLEMS

Figure 13: Dominance number distribution with 50 random samples on DTLZ2(10 objectives)

good whole space bad selected region

Hyper Volume

Figure 14: The range of hypervolume for 50 samples randomly generated from different regions in DTLZ2(10 objectives). We generate 25 times of 50 samples in total.

While it is theoretically hard to label the samples into good and bad based on their dominance number in many-objective problems due the the lack of dominance pressure(All samples are non-dominated with each other). If number of objective is not too large(i.e. M 10), the samples can be still split by dominance number. Given the problem(DTLZ2 with 10 objectives) shown in ﬁg 4, we randomly generate 50 samples in the search space and draw the dominance distribution of them(see ﬁg 13). We did this experiment 5 times.

We then partition the search space by a SVM classiﬁer based on the labeled samples into good and bad , and randomly generate 50 samples in good region , bad region , and whole space, respectively. We did this process 5 times with different initial samples. Fig. 14 shows the range of hypervolume of the samples generated from good regions, the whole space, and bad regions. From

Published as a conference paper at ICLR 2022

the ﬁgure, we can see that the hypervolume of samples generated from good regions are signiﬁcantly higher than others.

I COMPUTATIONAL COMPLEXITY ANALYSIS OF LAMOO

Here is a detailed breakdown of the computational complexity of our algorithm (Alg. 1)

Line.6: Compute dominance: O(MNnode 2) where Nnode is the number of samples in the node and M is the number of dimensions.

Line 7: Time complexity of SVM : O(Nnode 2) where Nnode is the number of samples in the node.

Line 10: Hypervolume: O(N M

2 + N log N)(M > 3) (Beume & Rudolph, 2006) or O(N log N)(M 3) (Beume et al., 2009), where N is number of searched samples in total and D is the number of dimensions.

Total time complexity: Pt i=1 O(Ni 2)(M < 3), where t is the total number of nodes. Pt i=1 O(N M

2 + Ni log Ni) (M > 3), where t is the total number of nodes.

When there are more than 3 objectives (M > 3), HV computation is the dominant factor. When M 3, the optimization cost of SVM is the dominant factor.

J VARIATION OF LAMOO WITH A CHEAPER OVERHEAD

Algorithm 3 La MOO Pseudocode with leaf based selection.

1: Inputs: Initial D0 from uniform sampling, sample budget T. 2: for t = 0, . . . , T do 3: Set L {Ωroot} (collections of regions to be split). 4: while L = do 5: Ωj pop_ﬁrst_element(L), Dt,j Dt Ωj, nt,j |Dt,j|. 6: Compute dominance number ot,j of Dt,j using Eqn. 5 and train SVM model h( ). 7: If (Dt,j, ot,j) is splittable by SVM, then L L Partition(Ωj, h( )). 8: end while 9: for k = root, k is not leaf node do 10: Dt,k Dt Ωk, nt,k |Dt,k|. 11: end for 12: for l is leaf node do 13: vt,l Hyper Volume(Dt,l) 14: end for 15: k arg max l leaf nodes UCBt,l, where UCBt,l := vt,l + 2Cp q

2 log(nt,l)

nt,p , where p is the parent of l.

16: Dt+1 Dt Dnew, where Dnew is drawn from Ωk based on q EHVI or CMA-ES. 17: end for

20 40 60 80 100 Samples

Log Hypervolume Diff

La MOO La MOO with leaf selection

(a) Branin Currin

20 40 60 80 100 Samples

Log Hypervolume Diff

La MOO La MOO with leaf selection

(b) Vehicle Safety

20 40 60 80 100 Samples

Log Hypervolume Diff

La MOO La MOO with leaf selection

(c) Nasbench201

Figure 15: Search progress with sample

Published as a conference paper at ICLR 2022

0 50 100 150 200 250 Average Cumulative Wall Clock Time(s)

Log Hypervolume Diff

q EHVI+La MOO La MOO leaf selection

(a) Branin Currin

0 250 500 750 1000 Average Cumulative Wall Clock Time(s)

Log Hypervolume Diff

q EHVI+La MOO La MOO leaf selection

(b) Vehicle Safety

0 100 200 300 400 Average Cumulative Wall Clock Time(s)

Log Hypervolume Diff

q EHVI+La MOO La MOO leaf selection

(c) Nasbench201

Figure 16: Search progress with time

Instead of traversing down the search tree to trace the current most promising search path, this variation of La MOO directly select the leaf node with the highest UCB value. Algorithm. 3 illustrates the detail of this variation. Therefore, this variation avoids calculating the hypervolume in the non-leaf nodes of the tree, where hypervolume calculation is the main computational cost of La MOO especially in many-objective problems. Figure. 15 and ﬁgure. 16 validate the variation that is able to reach similar performance of searched samples but saves lots of time. We leave the validation of others problems in the future works.

K ADDITIONAL RELATED WORKS

(Hashimoto et al., 2018a; Kumar et al., 2018; Hashimoto et al., 2018b) indeed leverage classiﬁers to partition search space and draw the new samples from the good region. However, (Hashimoto et al., 2018a; Kumar et al., 2018; Hashimoto et al., 2018b) randomly sampled in selected regions without integrating existing optimizers (e.g. Bayesian optimization, evolutionary algorithms). In addition, they progressively select the good regions without a trade-off of exploration and exploitation as we did by leveraging Monte Carlo Tree Search (MCTS). (Munos, 2011a; Wang et al., 2014; Kawaguchi et al., 2015) can be seen as the ﬁrst work to use MCTS to build hierarchical space partitions. But their partitions are predeﬁned (e.g., Voronoi graph, axis-aligned partition, etc) without learning (or adapting to) observed samples so far, except for (Wang et al., 2019; 2020; Yang et al., 2021), which are learning extensions coupled with MCTS. However, they all deal with single-objective optimization.

For multi-objective optimization, (Loshchilov et al., 2010; Seah et al., 2012; Pan et al., 2019) learns to predict the dominance rank of samples, without computing them algorithmic-ally, a slow process with many previous samples, in order to speed up the MOEAs algorithms. Unlike our paper, they do not partition the search space into good/bad regions. In contrast, La MOO computes the rank algorithmic-ally. Therefore, our contributions are complementary to theirs. We leave a combination of both as one of the future works.

La MOO V.S. La MCTS/La NAS: First, the mechanism of the partitioning of the search space is different. La MOO uses dominance rank to separate good from bad regions, while La MCTS uses a k-mean for region separation. La NAS is even more simple: it uses the median from the single objectives of currently collected samples and a linear classiﬁer to separate regions.

La MOO V.S. La P3: LAP3 is a planning algorithm tailored to RL with a single objective function. LAP3 also utilizes the representation learning for the partition space and planning space, while our La MOO doesn t.